Subject: Re: CVS commit: src/sys/dev/usb
To: Bill Studenmund <wrstuden@netbsd.org>
From: Tom Spindler <dogcow@babymeat.com>
List: tech-kern
Date: 02/26/2007 15:45:30
> >   Please note that fs/unicode.h does not handle UTF-16 surrogates
> > correctly.  What's worth, the API does not allow this to be fixed.
> > 
> >   (Unicode defines more characters than fit in a 16 bit int.  In
> > UTF-16, a character with a code above 0xffff is represented as two
> > surrogate values.  In UTF-8, it is encoded as a 5 byte sequence.
> > Encoding/decoding one 16 bit value at a time does not allow for this
> > conversion to be done correctly.)

Huh? You can encode 0x10000-0x10ffff in four UTF-8 bytes. 
CESU-8, on the other hand, encodes each surrogate pair as six bytes -
but its usage is discouraged; see http://unicode.org/faq/utf_bom.html#30