Subject: Re: CVS commit: src/sys/dev/usb
To: Tom Spindler <dogcow@babymeat.com>
From: Dieter Baron <dillo@danbala.tuwien.ac.at>
List: tech-kern
Date: 03/02/2007 14:29:23
On Thu, Mar 01, 2007 at 09:17:48AM -0800, Tom Spindler wrote:
> > #define UNICODE_DECOMPOSE    0x01  /* convert to decomposed NF */
> > #define UNICODE_PRECOMPOSE   0x02  /* convert to precomposed NF */
> 
> To be excessively pedantic: I'd also indicate here whether
> you're normalizing to NFD, NFC, NFKD, or NFKC, as defined in
> http://www.unicode.org/reports/tr15/ .

  Possibly, and also add the variant used by HFS+, if it is not
exactly equivalent to any of the above.  However, since these
conversion tables are large, maybe we should limit ourselves to the
variants we acutally use.

> > size_t utf8_to_utf16(uint16_t *out, size_t outlen,
> > 		     const char *in, size_t inlen,
> > 		     int flags, int *errcountp);
> 
> Why bother specifying inlen? If you copy out at most outlen chars...

  IN need not be NUL-terminated, in which case OUT will not be eiter.
Neither the pathname component passed to VOP_LOOKUP, nor the on disk
structure of at leaset HFS+ are NUL-terminated, so it makes little
sense to require NUL-termination in the conversion routine.

> To be extra whingy, I'd suggest you order the args more like strncpy -
> e.g. (out, in, inlen, flags, errcountp)

  Omitting outlen?  Actually, I don't care much about the order of
arguments.  I patterned it after memcpy (dst, src, len) but since
outlen and inlen need not be the same, both must be specified.


  To answer a question from your other mail: length are counted in
units of the type used: bytes for UTF-8 (type char), 16-bit words for
UTF-16 (type uint16_t).


> >   I'm not sure how to handle invalid imput.
> > 
> >     -) fs/unicode.h assumes invalid UTF-8 sequences to be ISO 8859-1
> >        (Latin 1).  NB: ISO 8859-1 text has a very low likelihood of being
> >        valid UTF-8.
> 
> I think this is not reasonable.

  Our current implementation in fs/unicode.h does this, however.

  Remember, this is (among others) used to convert file names to be
stored on HFS+ partitions.  So if a user moves a file with a name
encoded to ISO 8859-1 (which is not uncommon, i gather) onto the HFS+
partition, she ehter gets an EINVAL, or the name is converted to UTF-8
and the file is moved.  If it is moved back onto an ffs partition, the
file name will remain UTF-8 encoded.  Which is better?  Maybe add a
sysctl knob and let the user decide?

> >     -) What about UTF-16 surrogates that are not paired?
> 
> [4] says "Therefore a converter must treat this as an error." I'm
> inclined to agree.

  Okay, so do I.  The question was how to treat the error.  If it is a
file name stored on disk, do we drop it and make it impossible to
access that file?  Or do we simply UTF-8 encode the singel surrogate,
returning invalid UTF-8 that can, however, be used to access the file?

> >     -) Wat about overlong UTF-8 encodings (encoding a character in
> >        more bytes than necessary)?  The standard forbids these to be
> >        decoded, and they are unlikely to be meant as ISO 8859-1?
> 
> Given the sloppiness in canonicalization I've seen, I'd say silently
> accept it - or print a warning if debugging is turned on or the like.

  Overlong encoding has nothing to do with canonicalization, it's when
a unicode character is encoded into more bytes than necessary.  For
example, u002f (slash) is encoded as the single byte 0x2f.  Using the
two bytes 0xc0 0xaf is an overlong encoding.

  The UTF-8 specification explicitly forbids decoding them, since they
may bypass character checks made by Unicode unaware routines (like,
e.g. the pathname splitting on '/').  If we deocded the overlong
encoding above, we would create a file with '/' as part of its name.
Problematic at best.

> >     -) How do we want the file systems to deal with invalid input?
> >        Drop the offending bytes/words, signal an error (EINVAL)?
> 
> If possible, I'd like the following to happen: if we're writing our
> own filenames/whatever, raise an error; if it's on the media, we can't
> really do much about it so emit a warning at most. Dunno how practical
> this is, however.

  Not very, I'm afraid.  If we read a HFS+ directory, file names are
converted from UTF-16 to UTF-8.  If the calling program tries to open
one of the files, the UTF-8 string we created is converted back to
UTF-16.  In order to find the file, this round-trip conversion must be
the identity function (on case sensitive HFS+ volumes, file names are
compared UTF-16 word by word).

  On invalid UTF-8 input from the user, we can be more pedantic; we
just have to accept invalid UTF-8 we could create for invalid on-disk
UTF-16.

> A teensy implementation issue: I think it might not be a bad thing to
> emit BOMs in both converted UTF-8 and UTF-16 strings - although I don't
> know how well other OS's filesystem code would handle that. :-/

  Not at all, see above.  Also, BOMs make no sense in UTF-8, and since
we create UTF-16 in host byte order, BOMs make little sense there
either.

					yours,
					dillo