Subject: Re: CVS commit: src/sys/dev/usb
To: Dieter Baron <dillo@danbala.tuwien.ac.at>
From: Tom Spindler <dogcow@babymeat.com>
List: tech-kern
Date: 03/01/2007 09:17:48
> #define UNICODE_DECOMPOSE    0x01  /* convert to decomposed NF */
> #define UNICODE_PRECOMPOSE   0x02  /* convert to precomposed NF */

To be excessively pedantic: I'd also indicate here whether
you're normalizing to NFD, NFC, NFKD, or NFKC, as defined in
http://www.unicode.org/reports/tr15/ . See my commentary on the
invalid input stuff, too.

> size_t utf8_to_utf16(uint16_t *out, size_t outlen,
> 		     const char *in, size_t inlen,
> 		     int flags, int *errcountp);

Why bother specifying inlen? If you copy out at most outlen chars...
To be extra whingy, I'd suggest you order the args more like strncpy -
e.g. (out, in, inlen, flags, errcountp)

> HFS+ requires file names to be stored in decomposed
> form (u+0065 u+0301), while, IIUC, NTFS requires them to be in
> pre-composed form (u+0E9).  [I have not yet implemnted
> (de)composition.]

From what teh intarweb tells me (references at end), HFS uses
FCD[1] ("Fast C or D") which is a superset[2] of NFD. :-/ Similarly,
When Windows uses NTFS, it happens to use precomposed characters - but
NTFS itself doesn't actually notice or specify which to use.[3] (Worse,
it's only defined to use chars in the BMP (esp for sorting/comparison
purposes.)

>   I'm not sure how to handle invalid imput.
> 
>     -) fs/unicode.h assumes invalid UTF-8 sequences to be ISO 8859-1
>        (Latin 1).  NB: ISO 8859-1 text has a very low likelihood of being
>        valid UTF-8.

I think this is not reasonable.

>     -) What about UTF-16 surrogates that are not paired?

[4] says "Therefore a converter must treat this as an error." I'm
inclined to agree.
 
>     -) Wat about overlong UTF-8 encodings (encoding a character in
>        more bytes than necessary)?  The standard forbids these to be
>        decoded, and they are unlikely to be meant as ISO 8859-1?

Given the sloppiness in canonicalization I've seen, I'd say silently
accept it - or print a warning if debugging is turned on or the like.
 
>     -) How do we want the file systems to deal with invalid input?
>        Drop the offending bytes/words, signal an error (EINVAL)?

If possible, I'd like the following to happen: if we're writing our
own filenames/whatever, raise an error; if it's on the media, we can't
really do much about it so emit a warning at most. Dunno how practical
this is, however.

A teensy implementation issue: I think it might not be a bad thing to
emit BOMs in both converted UTF-8 and UTF-16 strings - although I don't
know how well other OS's filesystem code would handle that. :-/

references:
[1] http://www.mindspring.com/~markus.scherer/unicode/fcd.html
[2] http://www.limewire.org/pipermail/gui-dev/2003-January/001118.html
[3] http://blogs.msdn.com/michkap/archive/2006/09/24/769540.aspx
[4] http://www.unicode.org/unicode/faq/utf_bom.html#39