Subject: Re: Unicode support in iso9660.
To: None <tech-kern@NetBSD.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 11/21/2004 18:10:33
>> eg, in 8859-1, any string containing an octet in the 0x80-0x9f range
>> is invalid,
> Well, 0x80-0x9f _are_ valid characters, just not printable.

Oh!  Okay, I've been confused about 8859.  What are their meanings?
Did ISO adopt the ANSI X3.41/X3.64 meanings (and if so what did they do
with 80, 81, 82, 83, 98, 99, and 9a)?

> As for the file length overflow case, while possible, it's rare
> enough to not be worth considering IMO.

We obviously have drastically different ideas of how a filesystem
should be specified.  "A pathname component is limited to 255 bytes" is
reasonable to me.  "A pathname component is limited to some number of
characters between 63 and 255 depending on exactly what the characters
are and possibly other factors such as the current locale" is not.
(Quite aside from other aspects of the shift from bytes to characters.)

>> [...breaks programs that create files...]
> That's unfortunate, but arguably inevitable.

I don't see it as inevitable.  Nothing compels us to switch the
filesystem API/ABI from octet sequences to character sequences.

> If file-system does assume some internal encoding, there pretty much
> is no chance to escape that and non-conforming file names must be
> refused.

Yes.  But that is no reason to impose a character-centric view of
filenames on filesystems that, like FFS, are fundamentally
charset-agnostic.  (Indeed, FFS as an on-disk format doesn't even
impose the 0x2f and 0x00 restrictions; given a suitable A[BP]I, there's
no reason FFS couldn't permit all 256 possible octet values in filename
components.  Those are restrictions of the Unix A[BP]I, not of FFS.
The only on-disk restriction is the dot and dot-dot special names.)

If you're arguing for dragging all filesystemd down to the restrictions
imposed by any of them, that means getting rid of case-sensitivity,
multiple dots in names, and a whole passel of other things.

> We only get away this with msdosfs since we pretend all file names
> are in iso-8859-1,

...hm?  How is MS-DOS defined with respect to charset?  I'd always
thought it was entirely encoding-agnostic (while the usual interfaces
to it treat 0x2e as a separator between the "name" and "extension"
portions, with no way to escape 0x2e octets in either part, as far as I
can see that is no necessary part of the on-disk format).

> Changing msdosfs to use UTF-8 would break msdosfs for people happily
> using ISO-8859-1, however.

So?  It doesn't seem to bother you to break FFS for people using it for
other than UTF-8; why is this any different?

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B