Re: wide characters and i18n

To: tech-userlevel%netbsd.org@localhost
Subject: Re: wide characters and i18n
From: Matthew Mondor <mm_lists%pulsar-zone.net@localhost>
Date: Fri, 16 Jul 2010 11:30:04 -0400

On Fri, 16 Jul 2010 14:53:44 +0100
Sad Clouds <cryintothebluesky%googlemail.com@localhost> wrote:

> Utf-8 is a variable length encoding, meaning that some characters are
> represented as 1-byte, some as 2-bytes, and so on. The reason why many
> people like utf-8 is because ascii characters are encoded in the same
> way, which does not break older software and because utf-8 encoding is
> independent of byte-order, i.e. it's just a sequence of bytes.

Other reasons why UTF-8 is convenient are that in C strings may still
be easily NUL ('\0', 0) terminated, and that the representation is
compact for languages which use no or few non-ASCII characters (albeit
the representation can also be considered bloated in some other
languages, unfortunately).

> Some people use utf-8 internally in their programs, but I'm not sure
> how easy it is to handle utf-8 strings, because each character could be
> 1, 2, or 3 bytes in length. You can not simply extract Nth character
> with 'character = utf8_string[N]' because of the variable length
> encoding.

I've seen in some code functions such as utf8_strlen() and the like;
but I also prefer working with a UCS-32/UTF-32 host-endian
representation internally, and to only use UTF-8 as a convenient
external representation.
-- 
Matt

References:
- Re: wide characters and i18n
  - From: Giles Lean
- Re: wide characters and i18n
  - From: Ken Hornstein
- Re: wide characters and i18n
  - From: Sad Clouds

Prev by Date: Re: wide characters and i18n
Next by Date: Re: wide characters and i18n
Previous by Thread: Re: wide characters and i18n
Next by Thread: Re: wide characters and i18n
Indexes:

Home | Main Index | Thread Index | Old Index