Subject: Re: Unicode support in iso9660.
To: None <tech-kern@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 11/22/2004 13:20:16
>> (1) it assumes that all filenames are, fundamentally, encoded
>>    character sequences;
> The GTK applications do need to display the filenames using X11.  How
> to do that without assuming that filenames are encoded character
> sequences?

By going the other way, by treating them as octet sequences which, for
display purposes, need to be represented as character sequences.

What method of converting the octet sequences to character sequences is
best depends on factors which a program cannot really determine without
human help - for one thing, the implicit comparative behind "best" is
largely human-driven.

For filenames that fundamentally *are* character sequences (most
keyboard-entered filenames, for example), ideally, you want to use the
converse of the mapping that was used to convert those characters to
octets.  UTF-8 works fairly well for most of these, actually; the only
problems with it are (a) assuming that everyone has Unicode fonts -
don't forget that a lot of filesystem programs don't use X directly -
and (b) not all keyboard-entered names are fundamentally characters; in
some cases the idea is "type something that will give me the octet
sequence this other piece of code generated".

Problem (b) is relatively common, but tends to be little-noticed
because most filenames that aren't fundamentally characters get encoded
in characters, and people tend to confuse the underlying name with its
encoding as characters.  For example, mail messages in my mailbox are,
fundamentally, named with small integers.  These get encoded in octet
sequences for use by the filesystem - but because of the probability
that humans will want to work with them with character-based display
and input tools, the encoding chosen is to represent the number in base
10 and then convert each abstract digit to the corresponding ASCII
digit (ie, add 48).  But this is a matter of convenience only, and when
typing a name to refer to such a file, it is more important that it
match what the program generated than that the sequence of characters
make any sense to a human.  (Sendmail queue files are another case;
there, the abstract name is more opaquely encoded, and the resulting
sequence of octets generally makes little sense as characters, though
they are chosen from a set that is unlikely to do bad things when taken
as an encoded character sequence.)

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B