tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)




On Sep 12, 2008, at 12:48 PM, Greg A. Woods; Planix, Inc. wrote:


On 12-Sep-08, at 5:05 AM, Alan Barrett wrote:

On Fri, 12 Sep 2008, Joachim Knig wrote:
But at least, we could make the UTF-8 encoding explicit by including
the BOM (byte order mark) at the beginning of such a file.It is the
byte sequence 0xEF 0xBB 0xBF.

There are (IMHO good) arguments against including BOM in UTF-8.  For
example, see <http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf>.

In my opinion those are really only arguments against _requiring_ use of a BOM on all UTF-8 files.

The first reason given ("On POSIX systems, the locale and not magic file type codes define the encoding of plain text files.") is either a mis-interpretation of strict POSIX requirements, or clear evidence that POSIX is so totally broken in this respect that it _must_ be ignored. The data must define its own encoding if multiple encodings are to be supported. The environment must not define the encoding of arbitrary data with arbitrary sources. See all the issues surrounding e-mail for one good example of how and why this must be. Note also there's a difference here between data and programs w.r.t. what POSIX is saying about the locale settings. The locale tells the program in what language and encoding to present program output to the user. A static data file does not get to choose its language and presentation when the user looks at it.


Depends on what the files is used for. It's entirely valid for a given application to say "XXX conf file must be ascii only".

For general text files which are read by anything, then yes a BOM is likely valid. There are cases where BOM's don't work however (which is why they are optional).

In general BOM is only useful for UTF-16 anyways since it has to distinguish endianness. For UTF-8 it's pretty much superfluous.

The second reason given ("Adding a UTF-8 signature at the start of a file would interfere with many established conventions such as the kernel looking for “#!” at the beginning of a plaintext executable to locate the appropriate interpreter.") obviously has some merit, since sadly the BOM is only recognized to be at the very beginning of a file or stream _and_ its value collides with that of the ZWNBSP character when it appears later in the data. Clearly the designers of these things didn't consider the possibility of one system having files of many different encodings and where users might do something like "sed 's/a/b/' * | awk '$0 ~ /blah/ {print}'" (i.e. what if "blah" here is non-ASCII and thus interpreted in the user's current locale?)


If the shell was updated to interpret UTF-8 the general practice of other apps is to do some of these:

1. Document the file must be encoded as UTF-8 (and can't take a BOM for directly executing ones due to legacy requirements like kernel exec support).

2. Add a flag indicating "said script is UTF-8 encoded" and assume otherwise it's ASCII per legacy. This is often what happens when converting tools that have legacy input formats that cannot be changed.

The kernel could also be modified to eat the BOM before looking for #! when doing exec and then letting the given application also see it. Any of these can be supported.


The third reason given ("Handling BOMs properly would add undesirable complexity even to simple programs like cat or grep that mix contents of several files into one. ") may have some merit too, though here the author also appears to be confused about some things. Obviously "cat" is a very poor choice of example, as he later admits directly. Also, if the author believes that POSIX allows only the environment to specify the encoding (as in the first reason), then, well, it seems inconsistent that he would believe it makes any sense to honour the encoding internally specified in each file being processed (especially when it is common and useful to combine the contents of many files into one stream).

I think somewhat clearer (and more protocol independent, or at least protocol aware) guidance is given right at the end of the "official" FAQ:

        <URL:http://www.unicode.org/faq/utf_bom.html#BOM>

I think the most important thing to keep in mind is that one must be aware of the current protocol for the data involved, and if there is none, implicitly or explicitly, then perhaps someone should think about defining a protocol to be used.

If I'm not mistaken even Plan 9's exclusive use of UTF-2/UTF-8 assumed that all files on the system would be either ASCII (i.e. plain and true 7-bit US ASCII, _not_ Latin-1 or any other 8-bit ASCII extension) or UTF-2 encoded Unicode, and never anything else. Perhaps this is what Joachim meant in part and simply mis-spoke by saying "Latin-1" and really meant 7-bit US ASCII?

In my opinion, and if I could start from scratch, then I would say that if one wishes to mix files with different encodings on one given system, then perhaps one should require that they all, always, be stored in 7-bit US-ASCII representations with appropriate MIME headers to specify their encoding and content-type. [0.5 :-)]


Where you get into headaches are things like iso-8859-1/shift-JIS/etc. These legacy encodings don't provide any way to tell what they are. There are heuristics one can use to guess but those require a decent amount of input to get it right. So in general for file processing you have to give tools some way to know "transcode this from XXX" when they're working on anything except ASCII.

BTW: is there anything except 7 bits for ASCII? Isn't 7-bit ASCII redundant then? i.e. as a standard that is.

James


Home | Main Index | Thread Index | Old Index