tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n

On Fri, 16 Jul 2010, Ken Hornstein wrote:
> But every time I try to read something written
> by someone who understands what is going on, I get lost, and I have never
> really seen anyone explain the answers to some basic questions:

The "Terminology" section of the wikipedia article on "Character encoding"
is not great, but it may help.

> - How, exactly, are UTF-8 and Unicode related? 

Unicode is a lot of things, but for the purposes of contrasting Unicode
with UTF-8, think of Unicode as a mapping from 21-bit integers to
characters; UTF-8 is then a set of rules for representing those
21-bit integers using sequences of 8-bit bytes or octets.

> - What exactly is a "code point"?

A code point is an integer, which maps to a character in a coded
character set.  For example, the code point for the letter "A" in the
ASCII coded character set is 65 or 0x41.  For all characters that appear
in the ASCII repertoire, their code points in ASCII and in Unicode are
identical (modulo quibbles about <hyphen> versus <minus sign> versus
<hyphen-minus>, and <apostrophe> versus <left single quote>).

> - What, exactly, do people mean by "normalization" in this context?

Do you represent <capital letter FOO with accent BAR> as a single
character, or as the two-character sequence <capital letter
FOO><combining accent BAR>?  What about <capital letter FOO with
accent BAR and accent BAZ>?  Is <ligature "ffi"> equivalent to <letter
"f"><letter "f"><letter "i">?  There are various types of normalisation
rules giving different answers to these and other questions.

--apb (Alan Barrett)

Home | Main Index | Thread Index | Old Index