Subject: Re: CVS commit: src/sys/dev/usb
To: Dieter Baron <dillo@danbala.tuwien.ac.at>
From: Tom Spindler <dogcow@babymeat.com>
List: tech-kern
Date: 03/01/2007 09:17:48
> #define UNICODE_DECOMPOSE 0x01 /* convert to decomposed NF */
> #define UNICODE_PRECOMPOSE 0x02 /* convert to precomposed NF */
To be excessively pedantic: I'd also indicate here whether
you're normalizing to NFD, NFC, NFKD, or NFKC, as defined in
http://www.unicode.org/reports/tr15/ . See my commentary on the
invalid input stuff, too.
> size_t utf8_to_utf16(uint16_t *out, size_t outlen,
> const char *in, size_t inlen,
> int flags, int *errcountp);
Why bother specifying inlen? If you copy out at most outlen chars...
To be extra whingy, I'd suggest you order the args more like strncpy -
e.g. (out, in, inlen, flags, errcountp)
> HFS+ requires file names to be stored in decomposed
> form (u+0065 u+0301), while, IIUC, NTFS requires them to be in
> pre-composed form (u+0E9). [I have not yet implemnted
> (de)composition.]
From what teh intarweb tells me (references at end), HFS uses
FCD[1] ("Fast C or D") which is a superset[2] of NFD. :-/ Similarly,
When Windows uses NTFS, it happens to use precomposed characters - but
NTFS itself doesn't actually notice or specify which to use.[3] (Worse,
it's only defined to use chars in the BMP (esp for sorting/comparison
purposes.)
> I'm not sure how to handle invalid imput.
>
> -) fs/unicode.h assumes invalid UTF-8 sequences to be ISO 8859-1
> (Latin 1). NB: ISO 8859-1 text has a very low likelihood of being
> valid UTF-8.
I think this is not reasonable.
> -) What about UTF-16 surrogates that are not paired?
[4] says "Therefore a converter must treat this as an error." I'm
inclined to agree.
> -) Wat about overlong UTF-8 encodings (encoding a character in
> more bytes than necessary)? The standard forbids these to be
> decoded, and they are unlikely to be meant as ISO 8859-1?
Given the sloppiness in canonicalization I've seen, I'd say silently
accept it - or print a warning if debugging is turned on or the like.
> -) How do we want the file systems to deal with invalid input?
> Drop the offending bytes/words, signal an error (EINVAL)?
If possible, I'd like the following to happen: if we're writing our
own filenames/whatever, raise an error; if it's on the media, we can't
really do much about it so emit a warning at most. Dunno how practical
this is, however.
A teensy implementation issue: I think it might not be a bad thing to
emit BOMs in both converted UTF-8 and UTF-16 strings - although I don't
know how well other OS's filesystem code would handle that. :-/
references:
[1] http://www.mindspring.com/~markus.scherer/unicode/fcd.html
[2] http://www.limewire.org/pipermail/gui-dev/2003-January/001118.html
[3] http://blogs.msdn.com/michkap/archive/2006/09/24/769540.aspx
[4] http://www.unicode.org/unicode/faq/utf_bom.html#39