tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



On Wed, Jul 14, 2010 at 07:38:42PM -0700, Erik Fair wrote:
 > Theoretically, the POSIX locale stuff is supposed to handle things
 > beyond that, but it's a more complicated and subtle problem than those
 > POSIX committees really thought about.

Indeed.

 > I commend this well written paper to your attention:
 > 
 > http://plan9.bell-labs.com/sys/doc/utf.html
 > 
 > which discusses what the Plan 9 people (Rob Pike, Ken Thompson,
 > et. al) did about the software problem (and what they did about
 > it), and explicitly what they decided to punt on. A precis: "we
 > replaced the ASCII assumption with Unicode/UTF-8 because UTF-8 is a
 > proper superset of ASCII (i.e. backward compatible) and also
 > subsumes pretty much all other interesting character sets (with
 > some warts) so we can translate into it without (much) semantic
 > information loss."

The problem with UTF-8 in Unix is that it doesn't actually solve the
labeling problem: given comprehensive adotpion you no longer really
need to know what kind of text any given file or string is, but you
still need to know if the file contains text (UTF-8 encoded symbols)
or binary (octets), because not all octet sequences are valid UTF-8.

I don't see a viable way forward that doesn't involve labeling
everything.

-- 
David A. Holland
dholland%netbsd.org@localhost


Home | Main Index | Thread Index | Old Index