tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Unicode programming



>> - Internally to your programs, do you use UTF-8 as your representation?
>>   UTF-16?  UTF-32?  I know, this depends on what you're doing; I'm just
>>   trying to get a sense of what is common.
>
>So far I've used UTF-32/UBCS-4 for internal representation and UTF-8 as
>external representation only in my own software.  A complication exists
>if invalid UTF-8 input sequences are possible (an example is IRC or
>random user-provided data), in which case there are various possible
>solutions:

Yeah, I've been thinking about that as well.  I get the impression
that the "correct" answer is to simply error and reject the invalid
sequence.  In this particular application at least it is unlikely
that something else will be mislabeled as UTF-8.

>A problematic example are filenames in file systems which allow
>arbitrary bytes (like FFS).  I tend to encounter both LATIN-1 and UTF-8
>filenames in French, but filenames are not tagged with an encoding.
>When you control the file creation and the remote protocol allows to
>know the encoding, I guess that you could tag filenames either using a
>MIME message header-like format (i.e. =3D?UTF-8?B?<...>=3D?=3D) or using an
>extended attribute or custom metadata format, but there is no definite
>standard to tag unicode filenames with their encoding.  Some file
>systems expect valid UTF-16 or UTF-8 strings, though.

Yeah, even worse I found out that MacOS X rewrites all filenames to UTF-8
using Normalization Form D, which I find particularly unfriendly (I like
Solaris's solution better; the original UTF-8 is preserved, but you cannot
create two files with different Unicode sequences that normalize to the
same thing).

>An area which I found slightly challenging was allowing the user to
>search within unicode data.

Sigh, I don't even want to think about that right now.  Baby steps ...

--Ken


Home | Main Index | Thread Index | Old Index