Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)

To: Joachim König <him%online.de@localhost>
Subject: Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
From: "Greg A. Woods; Planix, Inc." <woods%planix.ca@localhost>
Date: Tue, 16 Sep 2008 11:27:33 -0400


On 15-Sep-08, at 4:00 PM, Joachim König wrote:

On Fri, Sep 12, 2008 Greg A. Woods wrote:
Perhaps this is what Joachim meant in part and simply mis-spoke bysaying "Latin-1" and really meant 7-bit US ASCII?
Latin-1 seems to be the most used 8-Bit char encoding, at least it'sthe most used 8-bit encoding in python, but it's not that importantto me
if it's 7-bit ascii or something else. My main point was that there
should be a default in case we detect a text file where one of thebytes
has the 8th bit set. In the case of 7-bit ascii it would be an error.

Indeed. That's exactly my point. Since you cannot know what charsetor encoding is intended an error is exactly the right response. Only7-bit ASCII can be safely assumed to be universal (in POSIX, I think).

Of course Unix has always been a bit two-faced about this issue giventhat it has always intentionally treated text files and byte streamsidentically, to the occasional dismay of users of terminals withalternate character sets or such. :-)

The BOM OTOH doesn't solve the problem everywhere of course.


Indeed -- it requires one make a whole new set of assumptions.

The
fact that a file is a text file encoded in a certain way is actually
information about the file and should be stored somewhere in themetadata of the file and not in the byte stream itself.

That's the other thing -- unix/posix systems, rightly or wrongly,don't have this sophisticated level of metadata about file contentstored in the filesystem. Unix systems have the tradition, andlimitation, that the content of the file must be the sole source ofinformation about its own encoding, charset, form, and function.There isn't even any distinction between binary and text content. Theuser must decide how to use file content, though the user is giventools such as "strings" to help them to examine and classify unknowncontent.

The ideal solution, given these limitations, is to maintain theillusion that there is effectively only one true encoding and charset(for "text" files, at least), just as Unix has always done, and asPlan 9 does anew with UTF-8 (where ASCII is a proper subset, keepingbackwards compatability). To do that one might ensure that all toolswhich can fetch data from foreign systems would convert it to thenative system's encoding and charset where appropriate. Never store,or at least never try to use, foreign-encoded data.

In that scenario, one can use UTF-8 Unicode and fall back on 7-bitASCII since 7-bit ASCII is a true subset of UTF-8 Unicode. Just likePlan-9. They did it right over a decade ago.

The _sanest_ alternatives would be to add content-specifying metadatato the filesystem, and all the tools necessary to make sure it'salways set and used as correctly as possible; or to force all textfiles to be wrapped in MIME (and everything else to be treated asbinary). Anyone for a MIME processor in STDIO? :-)


--
                                        Greg A. Woods; Planix, Inc.
                                        <woods%planix.ca@localhost>

Attachment: PGP.sig
Description: This is a digitally signed message part

Follow-Ups:
- Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
  - From: der Mouse

References:
- Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
  - From: Joachim König

Prev by Date: Re: Allow acpidump(8) to disassemble SSDT AML data
Next by Date: Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
Previous by Thread: Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
Next by Thread: Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
Indexes:

Home | Main Index | Thread Index | Old Index