tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)

On 12-Sep-08, at 5:05 AM, Alan Barrett wrote:

On Fri, 12 Sep 2008, Joachim Knig wrote:
But at least, we could make the UTF-8 encoding explicit by including
the BOM (byte order mark) at the beginning of such a file.It is the
byte sequence 0xEF 0xBB 0xBF.

There are (IMHO good) arguments against including BOM in UTF-8.  For
example, see <>.

In my opinion those are really only arguments against _requiring_ use of a BOM on all UTF-8 files.

The first reason given ("On POSIX systems, the locale and not magic file type codes define the encoding of plain text files.") is either a mis-interpretation of strict POSIX requirements, or clear evidence that POSIX is so totally broken in this respect that it _must_ be ignored. The data must define its own encoding if multiple encodings are to be supported. The environment must not define the encoding of arbitrary data with arbitrary sources. See all the issues surrounding e-mail for one good example of how and why this must be. Note also there's a difference here between data and programs w.r.t. what POSIX is saying about the locale settings. The locale tells the program in what language and encoding to present program output to the user. A static data file does not get to choose its language and presentation when the user looks at it.

The second reason given ("Adding a UTF-8 signature at the start of a file would interfere with many established conventions such as the kernel looking for “#!” at the beginning of a plaintext executable to locate the appropriate interpreter.") obviously has some merit, since sadly the BOM is only recognized to be at the very beginning of a file or stream _and_ its value collides with that of the ZWNBSP character when it appears later in the data. Clearly the designers of these things didn't consider the possibility of one system having files of many different encodings and where users might do something like "sed 's/a/b/' * | awk '$0 ~ /blah/ {print}'" (i.e. what if "blah" here is non-ASCII and thus interpreted in the user's current locale?)

The third reason given ("Handling BOMs properly would add undesirable complexity even to simple programs like cat or grep that mix contents of several files into one. ") may have some merit too, though here the author also appears to be confused about some things. Obviously "cat" is a very poor choice of example, as he later admits directly. Also, if the author believes that POSIX allows only the environment to specify the encoding (as in the first reason), then, well, it seems inconsistent that he would believe it makes any sense to honour the encoding internally specified in each file being processed (especially when it is common and useful to combine the contents of many files into one stream).

I think somewhat clearer (and more protocol independent, or at least protocol aware) guidance is given right at the end of the "official" FAQ:


I think the most important thing to keep in mind is that one must be aware of the current protocol for the data involved, and if there is none, implicitly or explicitly, then perhaps someone should think about defining a protocol to be used.

If I'm not mistaken even Plan 9's exclusive use of UTF-2/UTF-8 assumed that all files on the system would be either ASCII (i.e. plain and true 7-bit US ASCII, _not_ Latin-1 or any other 8-bit ASCII extension) or UTF-2 encoded Unicode, and never anything else. Perhaps this is what Joachim meant in part and simply mis-spoke by saying "Latin-1" and really meant 7-bit US ASCII?

In my opinion, and if I could start from scratch, then I would say that if one wishes to mix files with different encodings on one given system, then perhaps one should require that they all, always, be stored in 7-bit US-ASCII representations with appropriate MIME headers to specify their encoding and content-type. [0.5 :-)]

                                        Greg A. Woods; Planix, Inc.

Attachment: PGP.sig
Description: This is a digitally signed message part

Home | Main Index | Thread Index | Old Index