tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



On Fri, 16 Jul 2010 08:17:32 -0400
Ken Hornstein <kenh%pobox.com@localhost> wrote:


> - How, exactly, are UTF-8 and Unicode related? 
> - What exactly is a "code point"?
> - What, exactly, do people mean by "normalization" in this context?
> - How do locales interoperate with UTF-8/Unicode?
> - And, most importantly: what do I, as a programmer, need to do to
> make my application work with all of the above?  I read the posted
> Plan 9 link, and I guess that in some cases I need to deal with
> "Runes" (if I was programming on Plan 9), but it's still not exactly
> clear.

Have a look at the O'reilly book "Unicode explained" if you want to
know what Unicode is. You may need to read it a few times in order to
fully understand. I'm still in the process of reading it.

From what I understand:

utf-8 is an encoding, i.e. it's a particular way to represent unicode
character as a sequence of octets (bytes). You also have other
encodings, like utf-16 and utf-32, they all represent the same unicode
characters, but encode them in different ways, i.e. 16-bit integer and
32-bit integer.

Utf-8 is a variable length encoding, meaning that some characters are
represented as 1-byte, some as 2-bytes, and so on. The reason why many
people like utf-8 is because ascii characters are encoded in the same
way, which does not break older software and because utf-8 encoding is
independent of byte-order, i.e. it's just a sequence of bytes.

With utf-16 and utf-32 you need to know if data was encoded in big or
little-endian byte order, with utf-8 you don't.

I think a code point is a unique code assigned to each character (or
location) in unicode. In ascii you have code points from 0 to 127, in
unicode you have many more.

I think locales are independent of unicode, i.e. locales can support
other systems for representing characters, not just unicode. I think
locale stuff was developed before unicode became widespread.

If you program in C on Unix, then I think using wchar_t is the most
sensible way. Wide character routines in C library take care of
string/character comparison and multi-byte to wchar_t conversion.

Some people use utf-8 internally in their programs, but I'm not sure
how easy it is to handle utf-8 strings, because each character could be
1, 2, or 3 bytes in length. You can not simply extract Nth character
with 'character = utf8_string[N]' because of the variable length
encoding.


Home | Main Index | Thread Index | Old Index