Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Convert" is ambiguous. When you read the text "<head>" from a template file, you probably want to represent it as the HTML "<head>", but when you read it from the database, you may, or may not want to represent it as "&lt;head&gt;".

People started using the word "sanitize" exactly because it conveys that information that "you want to treat it differently, depending of where it comes from". We also use the words "dirty" (sometimes "tainted") and "clean" conserving their usual relations to "sanitize".

Now somebody wants throw away a very concise and expressive jargon just because some people are giving bad advice on the Internet?



This has nothing to do with whether you read it from "a (template) file" or "the database", it's only about what _format_ it is in. If the template contains HTML, then the conversion to HTML (for the HTML part) is the identity function, of course, the same if the database contains HTML - if you want to use the same thing in a plain text email, you will have to convert to plain text. If, on the other hand, the template file or the database contains plain text, the reverse applies: conversion to plain text is the identity function, conversion to HTML is the usual replacement with entity references.

That you are using "dirty/tainted" and "clean" only shows how deep the confusion is. There is some justification to use those terms when talking about before and after validation, but other than that it's probably an indication of confusion (which also seems to be the common usage).

Take, for example, a general plain text field for optional free-form text. There is essentially nothing that could be validated (other than maybe that it's a valid UTF-8 string). Now, you want to generate a plain text email using the user input - how would you "sanitize" it?

There is nothing "better"/"cleaner" about any particular encoding, be it plain text, HTML, SQL, or any other, they are simply different encodings, and you have to always use the correct one, not the "best one"/"cleanest one", and you have to always know what format the data that you are processing is in so that you can convert correctly.

This jargon is not at all concise, actually (some people mean "remove 'strange' stuff/clean it", others mean "escape it", ...), and it makes you think in ways that obscure the actual problem that you are solving: Conversion between data types/data representations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: