tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



> [...] I have never really seen anyone explain the answers to some
> basic questions:

> - How, exactly, are UTF-8 and Unicode related?
> - What exactly is a "code point"?

Unicode is a character set: a mapping between small(ish) integers and
"character"s - which here means some kind of abstractions of the
marks-on-paper that non-computer writing uses so heavily.  (I say
"abstractions" because there is a sense in which, for example, all
lowercase "j" characters are the same regardless of which font, size,
etc is used; it is that abstract common entity that I mean by
"character" here.)

The characters are things like "Latin lowercase j" or "Devanagri ta" or
"Greek uppercase xi".  The integers are in the range 0 to 65535 (or at
least that's a workable approximation for purposes of this discussion;
the Unicode documentation I have does describe stuff above 65535, and
it might be better to go with 24 or 32 bits instead of 16, but most of
what I have to write here is independent of the exact range).

A code point is one of those integers.

UTF-8 is a way of encoding a character stream - well, really, a code
point stream, but the distinction between characters and the code
points representing them is often blurred - into an octet stream.  A
stream of Unicode codepoints is, conceptually, a stream of these
small(ish) integers.  Since there are more than 256 of them, they can't
be mapped to an octet stream as trivially as 8-bit sets like ISO-8859-1
or KOI-8 can (or <8-bit sets like ASCII).  UTF-8 has a variety of
interesting and important properties, some nice and some less nice, but
those aren't terribly relevant to your question, so I'll leave them for
another time.

> - What, exactly, do people mean by "normalization" in this context?

Unicode has something called combining characters.  These are a little
like dead accent keys on keyboards - the idea being that (to pick a
possibly fictitious example) you could represent an e-acute as the
two-character sequence <combining acute accent> <lowercase e>.  (You
wouldn't usually do so in this particular case, because there is an
e-acute character already, but may have to if you want something like a
circumflex over a dollar sign.)

Normalization is the process of finding such cases where a character
may be represented more than one way and converting them to some
uniform representation (so that all e-acutes, for example, are
represented the same way).  Which representation you pick is not
important - well, it's plenty important from many points of view, but
it's not important from the point of view of explaining the concept of
normalization.

> - How do locales interoperate with UTF-8/Unicode?

They are mostly orthogonal.  Locales are things like "does money get
printed with $ or £ or ¥ or Rs or what" and "does the number 1234567
get textified as 1,234,567 or 1 234 567 or 123 4567 or what" (to pick
two of the simplest examples).  Unicode and UTF-8 are relevant when you
try to represent those alternatives, but they aren't relevant to
picking which alternative to use (well, except that you presumably
aren't interested in supporting alternatives that call for characters
you don't have, an issue which mostly goes away with Unicode).

> - And, most importantly: what do I, as a programmer, need to do to
>   make my application work with all of the above?

That varies drastically depending on what your application does and
what its target audience is.  I can't outline it all here; much of this
thread has been about various aspects of that very question.

> I'm not saying anyone should feel obligated to answer these questions
> (but, hey, if you have a good reference, I'd be glad to read it), but
> I'm trying to illustrate the information gap that prevents some
> people from participating in these discussions in a meaningful way.

I don't have a good reference to point you at.  Most of the above is
not stuff I got out of a reference; it's stuff picked up from many
assorted places over the years.  It's possibly relevant that one of my
better friends actually cares about Unicode, has put a significant
amount of his own time into working with the various bodies involved,
and such.  He taught me a nontrivial amount of the above.

> I try to be a good international citizen, I really do ... but in a
> practical sense it seems to be _so_ complicated

It is.  The world is a complicated place.  Trying to build software
that can deal with even a significant fraction of that complexity is
not a simple task.

> that I basically just punt and end up doing what I always do ... and
> it seems that as long as I'm 8-bit clean, that makes me and most of
> the Europeans happy enough (although it tends to piss off Japanese
> and Chinese users, and I'm sorry about that).

The "encoded characters are 1-to-1 with octets" assumption is a common
one, and, yes, it does tend to tick off those who use larger character
sets.  I have a mailing-list acquaintance living in Japan who routinely
omits the f when writing about shift-JIS. :)  This is part of the
reason that I wrote, upthread, that I think that if you want to use
Unicode more than trivially you should just bite the bullet and stop
working with octets except as an unfortunate I/O evil.  UTF-8 is a
valiant attempt to deal with the impedance mismatch between Unicode
character strings and octet strings, but it can't cure it.

This is not to say that I am not guilty too.  I write a lot of code
with English strings and 8-bit chars and the assumption that a char
represents exactly one character.  I'm not happy about it either; when
I write my own OS (hah, right) I intend to do it righter.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index