tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n

On Jul 15, 2010, at 13:42, David Holland wrote:

> The problem with UTF-8 in Unix is that it doesn't actually solve the
> labeling problem: given comprehensive adotpion you no longer really
> need to know what kind of text any given file or string is, but you
> still need to know if the file contains text (UTF-8 encoded symbols)
> or binary (octets), because not all octet sequences are valid UTF-8.
> I don't see a viable way forward that doesn't involve labeling
> everything.

If your goal is to be in deterministic file content nirvana, yes, that's the 
way to get there, but I'd argue it's an awful lot of work to deal with the M x 
N software problem I mentioned (and we'll have to add a type field to inodes 
which will trigger a very old debate about whether UNIX files should be just 
bags of bytes; the required changes for the full M x N is pretty pervasive and 
invasive), and the easy counter argument in an open source OS community is: 
"OK, who's going to write and test all that code?"

The Plan 9 people didn't shoot for a utopia - as is often their wont, they 
improved the situation a whole lot (Unicode/UTF-8 is a lot more expressive and 
encompassing of the possible space of human communications than ASCII or 
ISO-8859-1) with a relatively modest effort, and it's "good enough" for a much 
wider range of applications than the previous default of ASCII or ISO-8859-1 
(does sort(1) even work right with ISO-8859-1? The man page in NetBSD 5.0 is 
silent on that question, but given where the diacritical characters are in the 
ISO-8859-1 codeset space, I bet it doesn't collate properly with a straight 
byte-numerical sort).

The more I ponder this, the more I think that:

1. the ASCII default status quo isn't good enough any more (and I'm sure our 
users in south & east Asia, not to mention eastern Europe, would agree),

2. Unicode/UTF-8 as a new default offers backward compatibility while expanding 
the character space quite broadly, and without anywhere near as much work (or 
as much paradigm shift, i.e. breaking "Unix files are a bag of bytes") on our 

3. the "change the base software default" approach can allow us to examine and 
call out our software's implicit assumptions (e.g. "I'm operating on ASCII" or 
"I need to parse these bytes semantically") so that if/when we decide to make a 
run a the bigger "let's handle all character sets" M x N problem, we'll know 
much better what needs to be done.

4. we even have "later mover" advantage - the Plan 9 paper describes what they 
did, and there's standards work (hopefully sane) that we can use if we deem it 

Think of it as a stepwise refinement in the direction of character set 
processing nirvana. My concern is that if we scope the problem too large by 
trying to do everything, we'll never get it done, with lots of sturm und drang 
in the process.

        Erik <>

Home | Main Index | Thread Index | Old Index