Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)

To: NetBSD Userlevel Technical discussion list <tech-userlevel%NetBSD.org@localhost>
Subject: Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
From: James Chacon <jmc%netbsd.org@localhost>
Date: Fri, 12 Sep 2008 13:30:34 -0500


On Sep 12, 2008, at 12:48 PM, Greg A. Woods; Planix, Inc. wrote:

On 12-Sep-08, at 5:05 AM, Alan Barrett wrote:
On Fri, 12 Sep 2008, Joachim Knig wrote:
But at least, we could make the UTF-8 encoding explicit by including
the BOM (byte order mark) at the beginning of such a file.It is the
byte sequence 0xEF 0xBB 0xBF.
There are (IMHO good) arguments against including BOM in UTF-8.  For
example, see <http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf>.
In my opinion those are really only arguments against _requiring_use of a BOM on all UTF-8 files.
The first reason given ("On POSIX systems, the locale and not magicfile type codes define the encoding of plain text files.") is eithera mis-interpretation of strict POSIX requirements, or clear evidencethat POSIX is so totally broken in this respect that it _must_ beignored. The data must define its own encoding if multipleencodings are to be supported. The environment must not define theencoding of arbitrary data with arbitrary sources. See all theissues surrounding e-mail for one good example of how and why thismust be. Note also there's a difference here between data andprograms w.r.t. what POSIX is saying about the locale settings. Thelocale tells the program in what language and encoding to presentprogram output to the user. A static data file does not get tochoose its language and presentation when the user looks at it.

Depends on what the files is used for. It's entirely valid for a givenapplication to say "XXX conf file must be ascii only".

For general text files which are read by anything, then yes a BOM islikely valid. There are cases where BOM's don't work however (which iswhy they are optional).

In general BOM is only useful for UTF-16 anyways since it has todistinguish endianness. For UTF-8 it's pretty much superfluous.

The second reason given ("Adding a UTF-8 signature at the start of afile would interfere with many established conventions such as thekernel looking for “#!” at the beginning of a plaintext executableto locate the appropriate interpreter.") obviously has some merit,since sadly the BOM is only recognized to be at the very beginningof a file or stream _and_ its value collides with that of the ZWNBSPcharacter when it appears later in the data. Clearly the designersof these things didn't consider the possibility of one system havingfiles of many different encodings and where users might do somethinglike "sed 's/a/b/' * | awk '$0 ~ /blah/ {print}'" (i.e. what if"blah" here is non-ASCII and thus interpreted in the user's currentlocale?)

If the shell was updated to interpret UTF-8 the general practice ofother apps is to do some of these:

1. Document the file must be encoded as UTF-8 (and can't take a BOMfor directly executing ones due to legacy requirements like kernelexec support).

2. Add a flag indicating "said script is UTF-8 encoded" and assumeotherwise it's ASCII per legacy. This is often what happens whenconverting tools that have legacy input formats that cannot be changed.

The kernel could also be modified to eat the BOM before looking for #!when doing exec and then letting the given application also see it.Any of these can be supported.

The third reason given ("Handling BOMs properly would addundesirable complexity even to simple programs like cat or grep thatmix contents of several files into one. ") may have some merit too,though here the author also appears to be confused about somethings. Obviously "cat" is a very poor choice of example, as helater admits directly. Also, if the author believes that POSIXallows only the environment to specify the encoding (as in the firstreason), then, well, it seems inconsistent that he would believe itmakes any sense to honour the encoding internally specified in eachfile being processed (especially when it is common and useful tocombine the contents of many files into one stream).
I think somewhat clearer (and more protocol independent, or at leastprotocol aware) guidance is given right at the end of the "official"FAQ:
        <URL:http://www.unicode.org/faq/utf_bom.html#BOM>
I think the most important thing to keep in mind is that one must beaware of the current protocol for the data involved, and if there isnone, implicitly or explicitly, then perhaps someone should thinkabout defining a protocol to be used.
If I'm not mistaken even Plan 9's exclusive use of UTF-2/UTF-8assumed that all files on the system would be either ASCII (i.e.plain and true 7-bit US ASCII, _not_ Latin-1 or any other 8-bitASCII extension) or UTF-2 encoded Unicode, and never anything else.Perhaps this is what Joachim meant in part and simply mis-spoke bysaying "Latin-1" and really meant 7-bit US ASCII?
In my opinion, and if I could start from scratch, then I would saythat if one wishes to mix files with different encodings on onegiven system, then perhaps one should require that they all, always,be stored in 7-bit US-ASCII representations with appropriate MIMEheaders to specify their encoding and content-type. [0.5 :-)]

Where you get into headaches are things like iso-8859-1/shift-JIS/etc.These legacy encodings don't provide any way to tell what they are.There are heuristics one can use to guess but those require a decentamount of input to get it right. So in general for file processingyou have to give tools some way to know "transcode this from XXX" whenthey're working on anything except ASCII.

BTW: is there anything except 7 bits for ASCII? Isn't 7-bit ASCIIredundant then? i.e. as a standard that is.


James

References:
- Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
  - From: Joachim König
- Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
  - From: Alan Barrett
- Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
  - From: Greg A. Woods; Planix, Inc.

Prev by Date: Re: UTF8 (Was: [PATCH] replace 0xA0 to whitespace in plain text files (part 2))
Next by Date: Re: UTF8 (Was: [PATCH] replace 0xA0 to whitespace in plain text files (part 2))
Previous by Thread: Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
Next by Thread: Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
Indexes:

Home | Main Index | Thread Index | Old Index