Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%netbsd.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: "James K. Lowden" <jklowden%schemamania.org@localhost>
Date: Fri, 5 Apr 2013 14:37:10 -0400

On Tue, 2 Apr 2013 20:27:31 -0400 (EDT)
Mouse <mouse%Rodents-Montreal.ORG@localhost> wrote:

> > Once you accept the idea the filenames are strings and strings are
> > encoded, you no longer need defend the airless proposition that
> > they're uninterpretable octet sequences.
> 
> You are still being sloppy with "filename".  There are at least two
> things that can reasonably be called filenames.  One of them (let's
> call it type 1) is a conceptual thing, occurring in human minds, and
> usually either is a character string or is implemented as a character
> string, depending on which way you prefer to think of it.  The other
> (type 2) is an octet sequence such as is passed to open(2) and related
> calls.  Type 2 filenames are used to implement type 1 filenames, but
> can also be used for other things.

The octet sequence passed to open(2) is a filename and is an encoded
string.  It may be convenient from a kernel-programming perspective to
think of it as a simple opaque array, but it was provided from
userspace as a string.  It doesn't stop being a string -- or cease
being encoded -- merely because it transits the syscall boundary.  

> The most common encoding I use for filenames is ASCII (as I suspect is
> true for most people who live and work primarily in English) - those
> filenames can, of course, also be viewed as being in any superset of
> ASCII: 8859-*, UTF-8, KOI-8, what have you.  When I have
> non-ASCII-encodable type-1 filenames, I most often use 8859-1 for
> them. But there's no reason I couldn't use KOI-8, or 8859-7, or
> 8859-15, or, yes, even UTF-8, if I found it convenient.  

Unfortunately, though, you'll never find it convenient.  Because the
encoding scheme is not recorded with the filename or filesystem, it's
impossible to know how the string is encoded.  

This is the crucial point, Mouse: the name was encoded at creation
time.  To be interpreted correctly later, the same encoding must be
applied.  Decoding cannot be done correctly per the user's locale,
except by happy accident.  

Because the encoding is a property of the string, and because that
property is nowhere recorded, we're forced to make some assumption
about it when it is read and presented to the user.  

> it requires that each directory's names must all use the same
> encoding, at least unless you extend all available filesystems to
> carry encoding tags with the directory entries' names

I think by "directory's names" you mean the name of the directory, not
the names in the directory?  I certainly don't make that distinction:
we need to know the encoding of every filename.  

It is not absolutely necessary that every filename use the same
encoding.  It does greatly simplify things, though, for the very reason
you allude to: existing filesystems do not record the encoding.  

> In my opinion, giving upper layers the freedom to use whatever
> encoding they find convenient wins.  

There is no freedom in chaos.  

I think you'd agree that the kernel's job is to arbitrate resource use
and facilitate interprocess communication.  In the case of filenames,
you suggest processes should "use whatever encoding they find
convenient", but the convenient thing is to share encoding
choice (writer to reader), and they have no way to communicate that
choice. That is why the kernel must be involved. They also have no way
to persist that choice in the very filesystem where the choice is
manifested. That is why the filesystem must be involved, unless we are
forever to rely on administrative fiat.  

I would prefer that every fileystem be mounted with an encoding and
that the kernel enforce that encoding for all metadata operations.
Better -- and more convenient -- would be an extension to to the
filesystem denoting the encoding in use.  I'm not sure where that
information should be kept or what utility would affect it.  I would
guess disklabel(8).  

Tagging each filename with an encoding would be
counterproductive.  If we didn't have Unicode, or if Unicode had 
gross defects, then that might be a reasonable choice.  But we have
Unicode and, as in so many things, 1 is simpler than N.  

If you still think such a regime is overly restrictive, please describe
any situation in which a filename saved by user 1 with encoding A and
read by user 2 with encoding B serves either party's freedom or
convenience.  

> If you want to bloat your kernel with Unicode hair and cripple your
> users' and programs' ability to use whatever encodings they find
> appropriate for their tasks

I interpret "bloat" and "cripple" as content-free pejoratives.  "Hair"
I generally favor.  ;-)

You might be right that the kernel need not be involved.  In any case I
doubt it's necessary to embed an entire Unicode library in the kernel
to support and enforce normalized UTF-8 filenames.  

Normalization could be done in libc e.g. in open(2) before invoking the
kernel.  The kernel might be able to detect extant invalid names and
prevent the creation of new ones just by looking for combining
characters.  Perhaps rename(2) would *not* normalize the "from" name,
to permit corrections. There's lots of room for good engineering
choices once the fundamental need is recognized.  

What we cannot do is continue to march around with our fingers in our
ears shouting that filenames aren't names, that they have no encoding
because they bear no encoding.  

--jkl

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse

References:
- A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse

Prev by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index