Subject: Re: fs transcoding, was Re: Unicode support in iso9660.
To: None <tech-kern@NetBSD.org>
From: Chapman Flack <flack@cerias.purdue.edu>
List: tech-kern
Date: 11/22/2004 13:56:00
> Thus Jason Thorpe followed by Pavel Cahyna:
> 
> > Honestly, I think we need to reserve a chunk of space in the FFS
> > superblock to specify the encoding that the file system is using.
> 
> What would be this good for?

Not enough, I'm afraid; there will still be legacy filesystems to be mounted
that won't have this superblock information, and some of them may not be using
just a single encoding throughout.  The danger of partial solutions is sometimes
getting them done and moving on to something else, in the belief that maybe
"enough" has been done to get by.

Having a filesystem that doesn't specify a character encoding doesn't mean
that no character encoding has been used on that file system; it only means
that some human knows what encodings have or haven't been used on it, and the
file system doesn't.  For those file systems, we don't have the "what have we
got" side of the equation until there is a way for that human to provide that
assertion to the system.  My idea the other day about a sort of overlay mount
is only one possible way to do that, but the feature I hope any other solution
would share with it is that the admin is free to assert what encoding is or
isn't in use at any necessary point in the tree.

There's nothing /wrong/ with extending FFS to record that information; once
assertions are supported that's just a convenience so you can send someone
an FFS filesystem to mount without also sending instructions for the right
commands to use when mounting it.

I'd still be really interested in seeing what direction the Single UNIX Spec
is taking on these issues; I haven't had time to really look yet.

Partly related ... parts of this discussion are verging close to something I
wrote a few years ago concerning Java; I had been noticing the types of
errors Java programmers get into when they blur or misunderstand the
distinction between bytes, chars/Strings, Streams and Readers/Writers in
that language.  I introduced a third category in addition to encoding-aware
code and genuinely agnostic code: "provincial" code, which is full of
implicit assumptions that entail a character encoding but isn't aware of
them, and acts as if the whole world shares its assumptions just because it
doesn't know better.  I then tried to motivate a distinct programming style,
provincial-safe, to be used by deliberate choice in certain programs where it
is important never to break even though surrounded by provincial code, and
how to decide in what cases to use that style or another style such as
agnostic or encoding-aware.  I never expanded it into a paper but what I did
write on the idea is here:

http://www.gjt.org/javadoc/org/gjt/cuspy/ByteString.html

Maybe it will provide some useful language for thinking the problems out.

-Chap