Subject: Re: CVS commit: src/sys/dev/usb
To: Bill Studenmund <wrstuden@netbsd.org>
From: Dieter Baron <dillo@danbala.tuwien.ac.at>
List: tech-kern
Date: 03/01/2007 09:13:56
hi,

> >   Please note that fs/unicode.h does not handle UTF-16 surrogates
> > correctly.  What's worth, the API does not allow this to be fixed.
> >=20
> >   (Unicode defines more characters than fit in a 16 bit int.  In
> > UTF-16, a character with a code above 0xffff is represented as two
> > surrogate values.  In UTF-8, it is encoded as a 5 byte sequence.
> > Encoding/decoding one 16 bit value at a time does not allow for this
> > conversion to be done correctly.)
>=20
> Please feel free to suggest ways that this should be fixed. Patches are=
=20
> best!
>=20
> We all would like better unicode handling, and AFAIK no one is wedded to=
=20
> the existing interface.

  Okay, here is what I currently use in the HFS+ implementation
(netbsd-soc.cvs.sf.net:/cvsroot/netbsd-soc hfs/hfsp/unicode.[ch]):

#define UNICODE_DECOMPOSE    0x01  /* convert to decomposed NF */
#define UNICODE_PRECOMPOSE   0x02  /* convert to precomposed NF */

size_t utf8_to_utf16(uint16_t *out, size_t outlen,
		     const char *in, size_t inlen,
		     int flags, int *errcountp);

  Converts the UTF-8 string IN to UTF-16 and stores at most OUTLEN
words in OUT.  FLAGS may be one of the above to convert the string to
normal form during conversion.  if ERRCOUNTP is non-NULL, the number
of unconvertible input bytes is stored there.  The function returns
the number of words needed to hold the complete converted string.  OUT
may be NULL, in which case the converted string is discarded.

size_t utf16_to_utf8(char *out, size_t outlen,
		     const uint16_t *in, size_t inlen,
		     int flags, int *errcountp);

  This has the same calling convention as utf8_to_utf16, but converts
=66rom UTF-16 to UTF-8.


  A note on Unicode normal form: Unicode allows some sequences of
characters to be represented by multiple, equivalent forms. For
example, the character =C3=A9 can be represented as the single Unicode
character u+00E9 (latin small letter e with acute), or as the two
Unicode characters u+0065 and u+0301 (the letter "e" plus a combining
acute symbol).  HFS+ requires file names to be stored in decomposed
form (u+0065 u+0301), while, IIUC, NTFS requires them to be in
pre-composed form (u+0E9).  [I have not yet implemnted
(de)composition.]


  I'm not sure how to handle invalid imput.

    -) fs/unicode.h assumes invalid UTF-8 sequences to be ISO 8859-1
       (Latin 1).  NB: ISO 8859-1 text has a very low likelihood of being
       valid UTF-8.

    -) What about UTF-16 surrogates that are not paired?

    -) Wat about overlong UTF-8 encodings (encoding a character in
       more bytes than necessary)?  The standard forbids these to be
       decoded, and they are unlikely to be meant as ISO 8859-1?

    -) How do we want the file systems to deal with invalid input?
       Drop the offending bytes/words, signal an error (EINVAL)?


					yours,
					dillo