tech-userlevel: Re: utf-8 and userland

Subject: Re: utf-8 and userland
To: None <tech-userlevel@NetBSD.org>
From: Dave Huang <khym@azeotrope.org>
List: tech-userlevel
Date: 03/12/2004 19:30:52

On Fri, Mar 12, 2004 at 08:02:45PM -0500, James K. Lowden wrote:
> On Sat, 13 Mar 2004, Noriyuki Soda <soda@sra.co.jp> wrote:
> > Yes, you can use iswprintf(3) by converting the multibyte characters
> > to wide characters.
> 
> I don't see how that can be right.  iswprint(3) takes a wint_t argument;
> the UTF-8 character will be a sequence of 1-4 bytes.  Even if you redefine
> the argument, how is ls(1) supposed to know where the character boundaries
> are?  

I think he means you can use iswprint(3) _after_ converting the
multibyte characters to wide characters, not "by converting...". I.e.,
use mbtowc(3) to convert from UTF-8 to a wide character first, then
use iswprint(3) to check the result. mbtowc knows where the boundaries are.

> It's my understanding that "wide characters" refer to a class of encodings
> that predate Unicode and UTF-8.  New times, new features....

A wide character is just a character that's bigger than a byte. While
they may predate Unicode, they're not obsolete or superseded by UTF-8.

Semi-offtopic, but tcsh's builtin ls-F handles UTF-8 properly.
However, commandline editing doesn't work right. I posted a bug report
to the tcsh-bugs mailing list, but I think it was ignored...
-- 
Name: Dave Huang         |  Mammal, mammal / their names are called /
INet: khym@azeotrope.org |  they raise a paw / the bat, the cat /
FurryMUCK: Dahan         |  dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 28 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++