tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



Joerg Sonnenberger <joerg%britannica.bec.de@localhost> wrote:

> On Wed, Jul 14, 2010 at 07:38:42PM -0700, Erik Fair wrote:
> > I commend this well written paper to your attention:
> > 
> > http://plan9.bell-labs.com/sys/doc/utf.html
> 
> ...which is also simplistic in the assumption and problems faced. If you
> want to know about the issues with I18N and Unicode in specific, don't
> ask Americans. Don't ask Europeans either, they only have slightly more
> exposure to the problems.

I suppose you shouldn't ask Australians either, although I'm
Australian and have been mixed up with I18N issues on and off
including for Asian languages over the last 20 years or so,
and have got to see some of the problems first hand.

Since the Plan9 URL has been mentioned, I hope it's not too
off topic to say that I concur that that paper is too
simplistic about the advantages of Unicode and UTF-8, and that
the very same problems are present in Google's new Go
language, several of whose designers participated in the Plan9
work.

For anyone who's not interested in the gory details of this
sort of stuff, please stop reading now.  It only gets uglier;
the world is a complex place, my Japanese friends have even
more objections to Unicode as "one size fits all" than I do
which I won't attempt to explain here, even if I were sure I
remembered them all.

For anyone who is interested in why s/ASCII/Unicode/ isn't
quite enough to write applications for worldwide use (even
worldwide use only in a single language, or even only for
worldwide use only in English!) here are a few points I find
left out of most discussions of Unicode.

The first two are points on which I disagree specifically with
the Plan 9 paper:

1. the decision not to address Unicode combining characters
2. the idea that the use of Unicode is sufficient excuse to
   provide any of the functionality of locales

#1 means applications dealing with arbitrary Unicode data
(whether UTF-8 or not) must handle normalistion before even
being able to compare two strings for equality. (This is
progress?)

Even English has _some_ characters with accents, although they
are rare and English speakers have seemingly become very
tolerant of their loss in the computer age, so this isn't
"just" a problem for European languages.  (Never mind the
rudeness of arbitrarily dropping accents from characters in
peoples' names.)

For #2, the glaring breakages in almost any application are
threefold:

    a) how do you sort anything?

       Even presuming English-only I'd like dictionary order
       sometimes, and other times ASCII for consistency with
       other applications or printed material, if it has used
       ASCII order.

       Non-English languages of course have their own rules
       which should be respected, and given the number of
       languages in the world and variations in local
       preferences it is only practical to allow _users_ to
       define collation order if no pre-existing order matches
       their preference or has been created for their
       language.

    b) how can you (ever) localise error messages?

       It would be a reasonable argument to say that an error
       message catalogue can be implemented indepdently of
       POSIX style locales, but localisation of an application
       certainly requires translation of error messages and
       indeed most of a typical application's user interface.

    c) how do you handle varying date formats?

       If I had a dollar (anyone's dollar -- Australian,
       Canadian, Singaporean, USD, whatever) for each time
       I've seen a date and had to stop and evaluate whether
       it was more likely MM/DD/YY or DD/MM/YY I imagine I
       could have retired long since.

3. An issue of current day importance (although not relevant
   to Plan9, as it was an operating system) is how file
   systems handle Unicode.

For #3 Unix -- in theory -- isn't too bad: most of its file
systems will take a series of bytes, disallowing only '/'
(which is represented as itself in UTF-8, so not typically a
problem) and '\0' (which UTF-8 avoids, so not a problem
either).

Where problems arise is where file systems (such as the
default file system on OS X) transform file names: the file
name you passed as valid UTF-8 to open() or creat() may not be
the same series of bytes you get back when you use readdir()
to examine the files in the directory.  This makes for
"interesting times" for any software which wants to store a
list of file names and then access them.

> Itojun mentioned some of the issues in
> ftp://ftp.itojun.org/pub/paper/itojun-freenix2001-presen.ps.gz

Recommended.

My personal expectation is that -- like it or not -- Unicode
in the form of UTF-8 will be (if it isn't already) "the new
ASCII", but I _do_ wish that language (and operating system)
designers and vendors would:

i.   specify the normal form of "their" UTF-8 strings
     (and perhaps allow programmers to override the default)

ii.  provide support for conversion to and from "foreign"
     UTF-8 normalisation forms

iii. handle -- as gracefully as possible -- the existing file
     system file name issues, and vendors should be encouraged
     (severely, if that's what it takes) to allow file names
     in _any_ Unicode encoding, and provide means to read
     those file names "as written" (presumably: "as bytes,
     trust me, I know what I'm doing") as well as "in my
     preferred encoding" and with a choice of errors or "best
     effort" conversion where file names are unrepresentable
     (e.g. invalid UTF-8 sequence, code point doesn't fit into
     UTF-16, etc).

Which still leaves open the problem of locales and issues of
multi-lingual documents and applications where a single
Unicode glyph really should be represented differently
depending upon what language it is being used for, but I did
say at the start of this too-lengthy message that the issues
get ugly.

The problems are hard; naÃve (that's "naive" with a diaeresis
above the 'i', in case it was garbled en-route to you)
solutions will always be incomplete.  Sweeping the
incompleteness under the carpet with the words "Well, it works
for me" is ... unimpressive.

Giles


Home | Main Index | Thread Index | Old Index