tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Fri, 5 Apr 2013 15:44:03 -0400 (EDT)
Mouse <mouse%Rodents-Montreal.ORG@localhost> wrote:

> > I would prefer that every fileystem be mounted with an encoding and
> > that the kernel enforce that encoding for all metadata operations.
> 
> Provided "opaque octet string" is an available encoding (basically,
> what we have now: effectively considering each octet as a single
> character and each distinct octet value as a different character),
> provided compiling out the other encodings removes their overhead, and
> provided the userland tools continue to work correctly with that
> encoding, I as an admin would be fine with this.

I think most of those requirements can be met.  It's hard to add an
optional feature that leaves no residual overhead when not used, even
if compiled out.  But I don't see why Unicode support couldn't take the
form of a user-installed callback or kernel module with a NULL
default.  

> > and they have no way to communicate that choice.
> 
> Oh, nonsense.  There are lots and lots of possible ways to communicate
> that choice, starting with hardwiring it into the code (yes, that's
> occasionally appropriate) to pushing it off to the user (environment
> variables and the like) to tagging that information onto the beginning
> of the (type-2) filename to storing a MIME charset parameter along
> with wherever the filename is stored or transmitted....

I find it useful to think of filesystems like web pages.  They're
created by one process and interpreted by another in a shared-nothing
scenario, where the producer and the consumer don't know each other.  It
is why HTTP got the Content-Encoding header.  

By "no way to communicate" I didn't mean nothing could be invented.  I
meant "have" in the present tense.  

> > Because the encoding is a property of the string, and because that
> > property is nowhere recorded, we're forced to make some assumption
> > about it when it is read and presented to the user.
> 
> Of course.  But you seem to consider it a catastrophe if that ever
> misfires; I consider the occasional need for human help in getting
> that right - or tolerance when it's gotten wrong - to be a totally
> acceptable price for preserving the flexibility that the
> opaque-octet-string model provides.  

You're right about how we see the trade-offs.  I don't see any value in
an "occasional need for human help", nor any benefit to the
informationless interpretation you so prize.  

Command line utilities should Just Work; they need definitive
information about the encoding of the strings they're handling, so that
as each string moves from one environment to another, it can be
*correctly* re-encoded to match. The OS needs definitive information
about encoding, so that filename globbing and string matching (with
non-canonical forms) can work. Anything else is just guesswork.  

I don't consider it a "catastrophe" when the guess is wrong.  I
consider it stupid.  The corrective for stupidity is information, and
in this case the missing information is the encoding.  

Your arbitrary-octet-sequence use is a vanishingly small fraction of
all filenames, far smaller than the set of names that would benefit from
a known encoding.  Any such sequence can be encoded as a string. If
that's unacceptable, I could accept an escape mechanism, perhaps
setting the first byte to NUL or 0x80.  

> > The octet sequence passed to open(2) is a filename and is an encoded
> > string.
> 
> Usually but not necessarily an encoded string - and, when it _is_ an
> encoded string, the encoding used is not available, since there's
> nowhere to pass it.

Nowhere to pass it is one problem.  No way to enforce it is another.
No way to know what the system-wide assertion about a filesystem's
encoding standard is a third.  

> To you, this appears to be a reason to mandate the use of one
> particular encoding.  

Reason to support the use and enforcement of an encoding on a
per-filesystem basis.  I suspect UTF-8 would be almost universally
adopted because of its technical advantages and the Network Effect.  

> I instead see it as a reason not to, because mandating an encoding
> that does not permit all octet sequences breaks existing practice

Existing practice is to assume.  Existing practice was born in a day
when filesystems weren't shared, when one admin group controlled all
mounts and all terminals, when one locale applied to all users and all
systems.  When, in short, it was safe to assume.  

You elevate that assumption to a sacred principle, and venerate
peculiar ways to exploit it in narrow circumstances.  I'm saying
something very simple that you agree with: that it's impossible to
interpret a string correctly without knowing its encoding.  You're just
unwilling to tolerate any change to make that possible.  

Every day, thousands of people type "ls" and pass a UTF-8 string to an
8859-1 terminal, with results only a mother could love.  Every day,
commands like "ls ?rets_fotos" fail to find "årets_fotos" because 
glob(3) doesn't know the difference between a byte and a character, and
can't be sure of the encoding of the candidate names.  Don't you think
it's about time we provided the technical facilities needed to make
these things work correctly?  

(It _would_ be nice if what we had were closer to a truly
encoding-blind opaque octet string model....)

I am curious what you mean by that.  You've alluded to it a couple of
times.  

--jkl


Home | Main Index | Thread Index | Old Index