tech-kern: Re: Unicode support in iso9660.

Subject: Re: Unicode support in iso9660.
To: None <tech-kern@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 11/22/2004 07:04:00
>>> It seems that all gtk2+ using applications [...] are supposed to
>>> assume that all path names are in UTF-8.
>> That seems rather severely broken, for any application that reads
>> filenames from the filesystem, [...]
> They are supposed to be used on encoding-agnostic filesystems - linux
> filesystems are not different in this respect.  I do not see anything
> wrong with it - even if the filesystem is encoding-agnostic, the
> filenames are encoded somehow,

You're falling into the same trap the people who started this off were:
you're thinking of filenames as character sequences, which have to be
encoded into octet sequences.

Certainly some filenames are.  Any chosen by a human for some kind of
human-perceptible meaning, for example, almost certainly will be.

But not all will be; I've remarked that I've written code that uses
filenames as a place to associated data with a file's contents (by
encoding various numbers in bases such as 16, 84, or 254), where the
conceptual file name is a large integer, with characters involved only
for entities that insist on interpreting octets as encoded characters.

> so all applications should agree on a common convention, and UTF-8
> was considered the best choice for such convention.

By one particular toolkit developer (or developer group).  One can
hardly use "agree" to refer to one group deciding what's Right and then
charging off and doing it in the hope others will do likewise.
Otherwise, we've agreed on using info files for documentation, because
that's what the FSF has decided on...and we've agreed to speak
Norwegian, because that's what the Norwegian government has chosen.

>> [What should programs do upon finding a] pathname component
>> containing an octet sequence which is not UTF-8?  I can't think of
>> _any_ action which isn't wrong in a substantial fraction of the
>> cases.
> Either you don't have such filenames in your filesystem, and the
> situation you describe can occur only as a result of user error, or
> you have them, in that case you don't want to use UTF-8 for filenames
> at all,

This is all very well on a single-user system, where a single person
can decide by fiat what shall and shall not be present.

But suppose I want my filenames to be charset-agnostic octet sequences,
and another user thinks filenames should be character sequences encoded
in UTF-8.  What happens when sie runs something that trips over one of
my files?

"User error"?  Which user made an error, and what was it?

> (Having some filenames in UTF-8 and others not does not make much
> sense, as you can't tell the encoding for a particular filename.)

Sure I can.

If it's mine it may well not be a character sequence at all, but an
octet sequence, with characters coming into it only secondarily.

If it's that other user's, it's UTF-8.

Maybe if it's someone else's, it's 8859-7, or whatever.

A program probably can't tell (though it may be able to rule out some
cases).  But why should a program need to?  File names as character
sequences make sense only (a) when being fed back to the entity that
created them, or a closely related one (eg, a human names a file, a
human reads the name), in which case it's somebody else's problem to
make sure the input and output environments to use compatible
encodings, or deal with the fallout, or (b) as an intermediate
representation used to transfer an octet sequence from one place/time
to another (as when printing something out for later input, possibly
elsewhere).

Yes, this way of looking at it pushes a significant fraction of the
burden off onto humans.  However, nothing else I've seen both supports
human levels of flexibility in encoding (such as having KOI-8 and
8859-1 file names in the same directory, some for use in one
environment and some in another, relying on the human ability to tell
what makes sense and what doesn't) and permits use of filenames that
fundamentally _aren't_ character sequences.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B