tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



Ken Hornstein <kenh%pobox.com@localhost> wrote:

> I try to understand, I really do ... I've been trying to understand for
> approximately 10 years now.  But every time I try to read something written
> by someone who understands what is going on, I get lost, and I have never
> really seen anyone explain the answers to some basic questions:
> 
> - How, exactly, are UTF-8 and Unicode related? 
> - What exactly is a "code point"?

http://www.unicode.org/reports/tr17/ addresses both.


> - What, exactly, do people mean by "normalization" in this context?

E.g. things like equivalence between single character "a with
diaeresis" vs. two characters "a" and "combining diaeresis".

http://unicode.org/reports/tr15/ has all the details.


> - How do locales interoperate with UTF-8/Unicode?

They are orthogonal.

Your word for Friday or the format used to print a date or your
culturally expected collation order exist independently of any coded
character set.  So you have ru_RU.KOI8-R locale, and ru_RU.ISO8859-5
locale and ru_RU.UTF-8 locale - each with its coded charset specific
encoding for the word for Friday and appropriate numeric tables to
make two strings (as encoded in the locale's charset) collate
according to the expected order.

If (abstract) character set of your locale is covered by Unicode,
which is true for many locales, you have an *internal implementation
option* to write your locale definitions using unicode to referer to
your characters and then you can mass-produce all other locales by
converting from unicode to the locale's coded charset.

E.g. you can write in the template locale definition (expressed in
unicode) that the abbreviated name for Friday is \u043f\u0442 and then
derive actual values for koi8-r, iso8859-5 and koi8-r locales by doing
the equivalent of iconv -f utf-16 -t $(locale_charset).  I think this
is what glibc does.

Alternatively you can just write all separate locale definitions in
their native charset.

-uwe



Home | Main Index | Thread Index | Old Index