tech-kern: Re: Unicode support in iso9660.

Subject: Re: Unicode support in iso9660.
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
From: Pavel Cahyna <pavel.cahyna@st.cuni.cz>
List: tech-kern
Date: 11/22/2004 15:19:25
On Mon, 22 Nov 2004 12:04:00 +0000, der Mouse wrote:

>> so all applications should agree on a common convention, and UTF-8 was
>> considered the best choice for such convention.
> 
> By one particular toolkit developer (or developer group).  One can hardly
> use "agree" to refer to one group deciding what's Right and then charging
> off and doing it in the hope others will do likewise. Otherwise, we've

There is a need to have one common encoding, and UTF-8 is a natural
candidate, so I do not see anything wrong with it. And GTK+ offers you the
needed flexibility if you don't like this.
 
>>> [What should programs do upon finding a] pathname component containing
>>> an octet sequence which is not UTF-8?  I can't think of _any_ action
>>> which isn't wrong in a substantial fraction of the cases.
>> Either you don't have such filenames in your filesystem, and the
>> situation you describe can occur only as a result of user error, or you
>> have them, in that case you don't want to use UTF-8 for filenames at
>> all,
> 
> This is all very well on a single-user system, where a single person can
> decide by fiat what shall and shall not be present.
> 
> But suppose I want my filenames to be charset-agnostic octet sequences,
> and another user thinks filenames should be character sequences encoded
> in UTF-8.  What happens when sie runs something that trips over one of
> my files?
> 
> "User error"?  Which user made an error, and what was it?

Both made the error of poor communication: you can't assume to display
correctly other user's files if you don't agree on a common encoding. If
you need to see other people's $HOME, you probably work together on
some common project, so there should be no problem in choosing a common
encoding.

>> (Having some filenames in UTF-8 and others not does not make much
>> sense, as you can't tell the encoding for a particular filename.)
> 
> Sure I can.
> 
> If it's mine it may well not be a character sequence at all, but an
> octet sequence, with characters coming into it only secondarily.

Well, that's your decision. I do not propose to make changes to disable
that possibility, but this possibility should not IMHO influence the
design of commonly used applications.
 
> If it's that other user's, it's UTF-8.
> 
> Maybe if it's someone else's, it's 8859-7, or whatever.

OK, on multiuser systems we can at least say that the encoding should be
chosen $HOME - wide, if not system-wide.

> A program probably can't tell (though it may be able to rule out some
> cases).  But why should a program need to?  

To display them properly.

> File names as character
> sequences make sense only (a) when being fed back to the entity that
> created them, or a closely related one (eg, a human names a file, a
> human reads the name), in which case it's somebody else's problem to
> make sure the input and output environments to use compatible encodings,
> or deal with the fallout,

So I should choose the font with appropriate encoding in file-selector
widgets every time I encounter a filename with another encoding? This
sucks - people expect the file-selector widgets and so on to "just work".

> Yes, this way of looking at it pushes a significant fraction of the
> burden off onto humans.  However, nothing else I've seen both supports
> human levels of flexibility in encoding (such as having KOI-8 and 8859-1
> file names in the same directory, some for use in one environment and
> some in another, relying on the human ability to tell what makes sense

Why to have a flexibility in encoding? One encoding which can represent
all the possible characters should be enough.

> and what doesn't) and permits use of filenames that fundamentally
> _aren't_ character sequences.

OK, in that case don't expect the applications other than the one which
created them to do anything meaningful with them.

Bye	Pavel