tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)



On Fri, Sep 12, 2008 Greg A. Woods wrote:

Perhaps this is what Joachim meant in part and simply mis-spoke by saying 
"Latin-1" and really meant 7-bit US ASCII?

Latin-1 seems to be the most used 8-Bit char encoding, at least it's the most used 8-bit encoding in python, but it's not that important to me
if it's 7-bit ascii or something else. My main point was that there
should be a default in case we detect a text file where one of the bytes
has the 8th bit set. In the case of 7-bit ascii it would be an error.

The BOM OTOH doesn't solve the problem everywhere of course. The
fact that a file is a text file encoded in a certain way is actually
information about the file and should be stored somewhere in the metadata of the file and not in the byte stream itself. In most filesystems, this knowledge has to come from somewhere else and the application has to correctly specifiy the mode when opening the
file (e.g. 'rb' or 'r') and doing the decoding itself, but the
fopen-ing doesn't make a difference on unix, but on Windows. The BOM
is only a vehicle to help guessing the encoding and ordering.

Joachim



Home | Main Index | Thread Index | Old Index