Subject: Re: Unicode support in iso9660.
To: None <tech-kern@NetBSD.org>
From: Valeriy E. Ushakov <uwe@ptc.spbu.ru>
List: tech-kern
Date: 11/18/2004 23:42:48
On Thu, Nov 18, 2004 at 23:37:15 +0900, SODA Noriyuki wrote:

> >>>>> On Thu, 18 Nov 2004 17:23:53 +0300,
> 	Vsevolod Stakhov <cebka@jet.msk.su> said:
> 
> > Yes, this should be the most simple and convenient way. But we can
> > use unicode to represent file names internally. Then if we specify
> > translation flag in mount options we translate names from unicode to
> > local charset for userspace.
> 
> Unfortunately, that may break some existing installations which have
> multiple codesets as pathnames in a single FFS.
> You can say such filesystem is broken (and I may agree ;-)), but
> I think it isn't an option to break existing installations.
> 
> Apparently, we need a converter which converts all codesets to
> unicode (and vice verse) in our kernel (for NFSv4, NTFS, VFAT,
> etc...), but using single codeset as an intermediate codeset like
> above cannot solve the problem.
> 
> But an extended iconv-like interface can...

Agreed.  That's why I've been talking about transcoding, not encoding.
No intermediate representation, just a de/mangling scheme (not even
limited to iconv-like transcoding - rot13, upcasing, whatever, those
latter are just less useful).

I also think that the kernel should be as much agnostic about
encodings as possible.  A ffs filename is a stream of bytes without
slashes and nuls, full stop.  ffs probably should not care if it's
utf-8 (originally called "file system safe utf" for a reason),
latin-1, or koi8-r.

When there's a file in the same ffs filesystem with a name in latin1 -
ffs doesn't really interpret the sequence of bytes in any way.  I
still can access the file.  Even if I use koi8-r locale and the file
name is in cp1251, I still can do useful things with it (ls | iconv;
or whatever).

But currently I cannot do anything like this with a CD with Russian
file names.  I haven't tried this in a while, so I specifically checked
this with my more or less -current laptop:

    $ ls /cdrom
    ??????   ????????????  ????     ??????

Yeah, that's like very useful... :(

Specifying trascoding for ISO9660 to let the above ls produce
something meaningful might seem to mean that one charset (target of
the transcoding) will be "more equal" that another, but

1) we already can have the same situation in ffs

    E.g. I mount the CD, copy files to my ffs HOME (using my charset)
    For a hypothetical person that uses another charset on the same
    machine there's *no* practical difference between

        /cdrom/some-koi8-r-name

    and

	/home/uwe/some-koi8-r-name

    be one of them "native" stream of koi8-r and the other a result of
    transcoding during mount.


2) on most machines all the users use the same encoding anyway.

    If you administer a big machine with lots of users that use
    different languages - force them to use utf-8 as that one encoding :)


My point is - providing transcoding option for ISO9660 mount doesn't
worsen the charset situation in any way, as compared to what we have
with ffs now.  Thus I don't think that providing this transcoding
service for ISO9660 should be presented as being dependent on finding
a silver bullet that would solve the general problem of peacful
coexistence of various encodings on one system.

SY, Uwe
-- 
uwe@ptc.spbu.ru                         |       Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/            |       Ist zu Grunde gehen