tech-userlevel: Re: utf-8 and userland

Subject: Re: utf-8 and userland
To: Dave Huang <khym@azeotrope.org>
From: None <itojun@iijlab.net>
List: tech-userlevel
Date: 03/14/2004 13:30:19

>On Sat, Mar 13, 2004 at 06:03:00PM -0500, James K. Lowden wrote:
>> Last I heard, the ANSI definition of "multibyte character" for mbtowc(3)
>> was something other than UTF-8.  How does mbtowc(3) know its input is
>> UTF-8?  And what is its output then, UCS-2?  
>
>http://www.opengroup.org/onlinepubs/007908799/xsh/mbtowc.html says
>that "The behaviour of this function is affected by the LC_CTYPE
>category of the current locale." That's how it tells... if LC_CTYPE is
>en_US.UTF-8, mbtowc converts from UTF-8. If it's zh_TW.Big5, it
>converts from Big5.
>
>The output is a wide character, which is an implementation-defined
>type. I don't know exactly what NetBSD's libc uses for wide
>characters, but it looks to me like UCS-4. However, the Citrus
>Project's web page at http://citrus.bsdclub.org/ mentions that, "...
>design contraints of the class 'Encoding must be ISO 2022' or
>'Encoding must be UCS4' are not acceptible." I don't know if that has
>any bearing on whether wchar_t is a UCS-4 character or not :)

	wchar_t has to be handled as opaque data; i.e. you should not assume
	certain encoding.  if you need some tests iswprint() and such are
	available.

itojun