tech-kern: fs transcoding, was Re: Unicode support in iso9660.

Subject: fs transcoding, was Re: Unicode support in iso9660.
To: None <tech-kern@NetBSD.org>
From: Chapman Flack <flack@cerias.purdue.edu>
List: tech-kern
Date: 11/20/2004 16:30:04
I've been sort of half-following this discussion....

> My feeling is that the convergence point should be "UTF-8 at the system 
> call layer", i.e. userland gives UTF-8 names to the kernel, the kernel 
> gives UTF-8 names to userland.  It would then be the responsibility of 
> the individual applications/system libraries/kernel subsystems to do 
> whatever translation to/from UTF-8 is required.

My first question would be, does the Single Unix Specification have anything
to say on this point?  It seems like a pretty fundamental, bold step to at
once declare that the byte strings passed across the system call boundary
are in a particular character encoding, and I'd hate to see that become a
fundamental split between Unix-derived systems.

A couple other thoughts occur to me.  There are always *two* things you need
to know before you can do transcoding: the encoding you've *got*, and the
encoding you *want*.  Some filesystems give you the first part: they specify
an encoding, so you always know what's there, and you can transcode it into
a user's desired encoding as appropriate.  Other filesystems are agnostic
and you need some out-of-band information on the encoding(s) actually
used when they were populated.  You can make that issue go away for locally-
created/populated filesystems by imposing a standard encoding, but that still
doesn't help when mounting a filesystem you didn't populate.

The encoding(s) *used in* a filesystem (I'll say "filesystem encoding" from
now on) can be specified by the filesystem itself or by some specification
given to mount, invariant with respect to who looks at the filesystem.  The
encoding *presented to a user* (I'll say "presentation encoding") can be
different for every user and presumably would come from the per-process locale
selection or a dedicated knob that works the same way.

So a filesystem might:

1.  Have a known filesystem encoding by virtue of being one of the fs types
    that specify an encoding
2.  Have a known filesystem encoding, even though it is one of the agnostic
    fs types, because that information has been given to mount
3.  Have no known filesystem encoding because it's agnostic and nothing was
    specified to mount.

In cases 1 and 2, user names get transcoded between presentation and filesystem
encoding and back, and it's an error to try to create a name that can't be
represented in the filesystem encoding, or to discover a name existing in the
file system that isn't a valid sequence in that encoding (which could happen
if you mount an agnostic file system and claim the wrong filesystem encoding
for it).  In case 3, you have traditional behavior where names are byte
sequences, no transcoding is done (regardless of any process locale setting),
and anything goes.  Something like statfs might tell you what case you were in.

When you claim a filesystem encoding for an agnostic filesystem, what you're
really doing is making an assertion.  Mounting an existing filesystem and
saying it's UTF8 doesn't make it so, just establishes that new names will be
encoded that way and existing names will be treated as encoded that way unless
invalid sequences are encountered, which would mean the assertion was wrong.

It occurs to me that maybe the way to make such an assertion is not with a
mount option when mounting the fs, but as a kind of overlay mount: you mount
/dev/foo as some agnostic FFS filesystem, and then you mount a UTF8 assertion
over it.  What that buys is a way to handle a legacy filesystem where different
encodings have been used in different parts of the tree: just mount the right
assertions in the right places.  You'd probably rarely want to create new
filesystems that way--just make it UTF8 all over and be done with it--but
if you have such a multi-coded filesystem to deal with, you can.  With a
filesystem like NTFS, you just don't need to mount any assertion over it.

Any thoughts?

-Chap