Subject: Re: utf-8 and userland
To: None <tech-userlevel@NetBSD.org>
From: Dave Huang <khym@azeotrope.org>
List: tech-userlevel
Date: 03/13/2004 17:32:29
On Sat, Mar 13, 2004 at 06:03:00PM -0500, James K. Lowden wrote:
> Last I heard, the ANSI definition of "multibyte character" for mbtowc(3)
> was something other than UTF-8.  How does mbtowc(3) know its input is
> UTF-8?  And what is its output then, UCS-2?  

http://www.opengroup.org/onlinepubs/007908799/xsh/mbtowc.html says
that "The behaviour of this function is affected by the LC_CTYPE
category of the current locale." That's how it tells... if LC_CTYPE is
en_US.UTF-8, mbtowc converts from UTF-8. If it's zh_TW.Big5, it
converts from Big5.

The output is a wide character, which is an implementation-defined
type. I don't know exactly what NetBSD's libc uses for wide
characters, but it looks to me like UCS-4. However, the Citrus
Project's web page at http://citrus.bsdclub.org/ mentions that, "...
design contraints of the class 'Encoding must be ISO 2022' or
'Encoding must be UCS4' are not acceptible." I don't know if that has
any bearing on whether wchar_t is a UCS-4 character or not :)

> Not technically, no.  But in practice, UTF-8 is much more attractive
> because can encode everything dreamt up thus far.  It's been adopted by
> XML and IMAP, just to name two, not to mention that it's the default Red
> Hat installation.  FWIW, I think it's header for ubiquity.  

By definition, wide characters can encode anything the system is
capable handling. It's often more convenient to have all characters be
a fixed width--moving a pointer to the next character is as simple as
"p++"; moving to the previous character is as simple as "p--".
However, using 32 bits per character can bloat your files. Hence the C
library gives you a choice of space-efficient multibyte characters and
fixed-width wide characters. Both encodings can represent the same
characters, so there's no advantage/disadvantage in that arena.
-- 
Name: Dave Huang         |  Mammal, mammal / their names are called /
INet: khym@azeotrope.org |  they raise a paw / the bat, the cat /
FurryMUCK: Dahan         |  dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 28 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++