Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: IDN hostname resolution in NetBSD



On Fri, May 28, 2010 at 12:50:34AM -0400, Geoff Adams wrote:
 > UTF-8, though, is a fantastic thing. It's very well thought-out and
 > has several technical features that make it really useful. Combined
 > with the feature that US-ASCII, the ISO 8869-* character sets
 > (including 8859-1, aka Latin 1), and Unicode all have the same
 > first 128 code points, UTF's encoding of the first 128 code points
 > of Unicode using the exact same bytes means that any US-ASCII text
 > can be correctly interpreted as UTF-8. This makes it very easy to
 > take many systems that were implemented only understanding US-ASCII
 > and convert them to full Unicode support without breaking backward
 > compatibility. In some cases, code doesn't have to change at all.

If that were actually true, UTF-8 might actually be a good thing.

In practice, UTF-8 breaks Unix, because it violates fundamental
assumptions.

The most serious of these is the assumption that there is no need to
distinguish between binary files and text files because both are just
octet streams. Unfortunately, in a UTF-8 world, some octet streams
cannot legally be handled as text files. Coping with this requires
either sacrificing functionality or doing massive rewrites.

Other severe problems arise when trying to use locales, because the
Unix locale system was grossly misdesigned; and for some reason it was
deemed desirable to conflate locale handling with character sets, so
you more or less can't use UTF-8 without getting sucked up in the
brokenness.

-- 
David A. Holland
dholland%netbsd.org@localhost


Home | Main Index | Thread Index | Old Index