tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)




On 15-Sep-08, at 4:00 PM, Joachim König wrote:

On Fri, Sep 12, 2008 Greg A. Woods wrote:

Perhaps this is what Joachim meant in part and simply mis-spoke by saying "Latin-1" and really meant 7-bit US ASCII?

Latin-1 seems to be the most used 8-Bit char encoding, at least it's the most used 8-bit encoding in python, but it's not that important to me
if it's 7-bit ascii or something else. My main point was that there
should be a default in case we detect a text file where one of the bytes
has the 8th bit set. In the case of 7-bit ascii it would be an error.

Indeed. That's exactly my point. Since you cannot know what charset or encoding is intended an error is exactly the right response. Only 7-bit ASCII can be safely assumed to be universal (in POSIX, I think).

Of course Unix has always been a bit two-faced about this issue given that it has always intentionally treated text files and byte streams identically, to the occasional dismay of users of terminals with alternate character sets or such. :-)

The BOM OTOH doesn't solve the problem everywhere of course.

Indeed -- it requires one make a whole new set of assumptions.

The
fact that a file is a text file encoded in a certain way is actually
information about the file and should be stored somewhere in the metadata of the file and not in the byte stream itself.

That's the other thing -- unix/posix systems, rightly or wrongly, don't have this sophisticated level of metadata about file content stored in the filesystem. Unix systems have the tradition, and limitation, that the content of the file must be the sole source of information about its own encoding, charset, form, and function. There isn't even any distinction between binary and text content. The user must decide how to use file content, though the user is given tools such as "strings" to help them to examine and classify unknown content.

The ideal solution, given these limitations, is to maintain the illusion that there is effectively only one true encoding and charset (for "text" files, at least), just as Unix has always done, and as Plan 9 does anew with UTF-8 (where ASCII is a proper subset, keeping backwards compatability). To do that one might ensure that all tools which can fetch data from foreign systems would convert it to the native system's encoding and charset where appropriate. Never store, or at least never try to use, foreign-encoded data.

In that scenario, one can use UTF-8 Unicode and fall back on 7-bit ASCII since 7-bit ASCII is a true subset of UTF-8 Unicode. Just like Plan-9. They did it right over a decade ago.

The _sanest_ alternatives would be to add content-specifying metadata to the filesystem, and all the tools necessary to make sure it's always set and used as correctly as possible; or to force all text files to be wrapped in MIME (and everything else to be treated as binary). Anyone for a MIME processor in STDIO? :-)

--
                                        Greg A. Woods; Planix, Inc.
                                        <woods%planix.ca@localhost>



Attachment: PGP.sig
Description: This is a digitally signed message part



Home | Main Index | Thread Index | Old Index