Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)

To: NetBSD Userlevel Technical discussion list <tech-userlevel%NetBSD.org@localhost>
Subject: Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
From: "Greg A. Woods; Planix, Inc." <woods%planix.ca@localhost>
Date: Fri, 12 Sep 2008 13:48:41 -0400


On 12-Sep-08, at 5:05 AM, Alan Barrett wrote:

On Fri, 12 Sep 2008, Joachim Knig wrote:

But at least, we could make the UTF-8 encoding explicit by including
the BOM (byte order mark) at the beginning of such a file.It is the
byte sequence 0xEF 0xBB 0xBF.


There are (IMHO good) arguments against including BOM in UTF-8.  For
example, see <http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf>.

In my opinion those are really only arguments against _requiring_ useof a BOM on all UTF-8 files.

The first reason given ("On POSIX systems, the locale and not magicfile type codes define the encoding of plain text files.") is either amis-interpretation of strict POSIX requirements, or clear evidencethat POSIX is so totally broken in this respect that it _must_ beignored. The data must define its own encoding if multiple encodingsare to be supported. The environment must not define the encoding ofarbitrary data with arbitrary sources. See all the issues surroundinge-mail for one good example of how and why this must be. Note alsothere's a difference here between data and programs w.r.t. what POSIXis saying about the locale settings. The locale tells the program inwhat language and encoding to present program output to the user. Astatic data file does not get to choose its language and presentationwhen the user looks at it.

The second reason given ("Adding a UTF-8 signature at the start of afile would interfere with many established conventions such as thekernel looking for “#!” at the beginning of a plaintext executable tolocate the appropriate interpreter.") obviously has some merit, sincesadly the BOM is only recognized to be at the very beginning of a fileor stream _and_ its value collides with that of the ZWNBSP characterwhen it appears later in the data. Clearly the designers of thesethings didn't consider the possibility of one system having files ofmany different encodings and where users might do something like "sed's/a/b/' * | awk '$0 ~ /blah/ {print}'" (i.e. what if "blah" here isnon-ASCII and thus interpreted in the user's current locale?)

The third reason given ("Handling BOMs properly would add undesirablecomplexity even to simple programs like cat or grep that mix contentsof several files into one. ") may have some merit too, though here theauthor also appears to be confused about some things. Obviously "cat"is a very poor choice of example, as he later admits directly. Also,if the author believes that POSIX allows only the environment tospecify the encoding (as in the first reason), then, well, it seemsinconsistent that he would believe it makes any sense to honour theencoding internally specified in each file being processed (especiallywhen it is common and useful to combine the contents of many filesinto one stream).

I think somewhat clearer (and more protocol independent, or at leastprotocol aware) guidance is given right at the end of the "official"FAQ:


        <URL:http://www.unicode.org/faq/utf_bom.html#BOM>

I think the most important thing to keep in mind is that one must beaware of the current protocol for the data involved, and if there isnone, implicitly or explicitly, then perhaps someone should thinkabout defining a protocol to be used.

If I'm not mistaken even Plan 9's exclusive use of UTF-2/UTF-8 assumedthat all files on the system would be either ASCII (i.e. plain andtrue 7-bit US ASCII, _not_ Latin-1 or any other 8-bit ASCII extension)or UTF-2 encoded Unicode, and never anything else. Perhaps this iswhat Joachim meant in part and simply mis-spoke by saying "Latin-1"and really meant 7-bit US ASCII?

In my opinion, and if I could start from scratch, then I would saythat if one wishes to mix files with different encodings on one givensystem, then perhaps one should require that they all, always, bestored in 7-bit US-ASCII representations with appropriate MIME headersto specify their encoding and content-type. [0.5 :-)]


--
                                        Greg A. Woods; Planix, Inc.
                                        <woods%planix.ca@localhost>

Attachment: PGP.sig
Description: This is a digitally signed message part

Follow-Ups:
- Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
  - From: James Chacon

References:
- Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
  - From: Joachim König
- Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
  - From: Alan Barrett

Prev by Date: Re: UTF8 (Was: [PATCH] replace 0xA0 to whitespace in plain text files (part 2))
Next by Date: Re: UTF8 (Was: [PATCH] replace 0xA0 to whitespace in plain text files (part 2))
Previous by Thread: Re: UTF8 (Was: [PATCH] replace 0xA0 to whitespace in plain text files (part 2))
Next by Thread: Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
Indexes:

Home | Main Index | Thread Index | Old Index