tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n

On Fri, 16 Jul 2010 14:53:44 +0100
Sad Clouds <> wrote:

> Utf-8 is a variable length encoding, meaning that some characters are
> represented as 1-byte, some as 2-bytes, and so on. The reason why many
> people like utf-8 is because ascii characters are encoded in the same
> way, which does not break older software and because utf-8 encoding is
> independent of byte-order, i.e. it's just a sequence of bytes.

Other reasons why UTF-8 is convenient are that in C strings may still
be easily NUL ('\0', 0) terminated, and that the representation is
compact for languages which use no or few non-ASCII characters (albeit
the representation can also be considered bloated in some other
languages, unfortunately).

> Some people use utf-8 internally in their programs, but I'm not sure
> how easy it is to handle utf-8 strings, because each character could be
> 1, 2, or 3 bytes in length. You can not simply extract Nth character
> with 'character = utf8_string[N]' because of the variable length
> encoding.

I've seen in some code functions such as utf8_strlen() and the like;
but I also prefer working with a UCS-32/UTF-32 host-endian
representation internally, and to only use UTF-8 as a convenient
external representation.

Home | Main Index | Thread Index | Old Index