tech-userlevel: Re: utf-8 and userland

Subject: Re: utf-8 and userland
To: None <tech-userlevel@NetBSD.org>
From: James K. Lowden <jklowden@schemamania.org>
List: tech-userlevel
Date: 03/13/2004 18:03:00

On Fri, 12 Mar 2004 19:30:52 -0600, Dave Huang <khym@azeotrope.org> wrote:
> On Fri, Mar 12, 2004 at 08:02:45PM -0500, James K. Lowden wrote:
> > On Sat, 13 Mar 2004, Noriyuki Soda <soda@sra.co.jp> wrote:
> > > Yes, you can use iswprintf(3) by converting the multibyte characters
> > > to wide characters.
> > 
> > I don't see how that can be right.  iswprint(3) takes a wint_t
> > argument; the UTF-8 character will be a sequence of 1-4 bytes.  Even
> > if you redefine the argument, how is ls(1) supposed to know where the
> > character boundaries are?  
> 
> I think he means you can use iswprint(3) _after_ converting the
> multibyte characters to wide characters, not "by converting...". I.e.,
> use mbtowc(3) to convert from UTF-8 to a wide character first, then
> use iswprint(3) to check the result. mbtowc knows where the boundaries
> are.

Last I heard, the ANSI definition of "multibyte character" for mbtowc(3)
was something other than UTF-8.  How does mbtowc(3) know its input is
UTF-8?  And what is its output then, UCS-2?  

I came to this late in the game, and I'm only familiar with iconv, which I
view as a generalized implementation of mbtowc & Co.  

> A wide character is just a character that's bigger than a byte. While
> they may predate Unicode, they're not obsolete or superseded by UTF-8.

Not technically, no.  But in practice, UTF-8 is much more attractive
because can encode everything dreamt up thus far.  It's been adopted by
XML and IMAP, just to name two, not to mention that it's the default Red
Hat installation.  FWIW, I think it's header for ubiquity.  

--jkl