Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Fri, 5 Apr 2013 15:44:03 -0400 (EDT)

> The octet sequence passed to open(2) is a filename and is an encoded
> string.

Usually but not necessarily an encoded string - and, when it _is_ an
encoded string, the encoding used is not available, since there's
nowhere to pass it.

To you, this appears to be a reason to mandate the use of one
particular encoding.  I instead see it as a reason not to, because
mandating an encoding that does not permit all octet sequences breaks
existing practice; decreeing that a certain arcane and complicated set
of equivalence rules apply breaks existing practice further and removes
userland's ability to use whatever encoding it finds appropriate to its
task - it amounts to decreeing that your preferred encoding is suitable
for all tasks, a manifestly ludicrous statement when put that baldly,
but I can't really put any other construction on it.

>> The most common encoding I use for filenames is ASCII [...].  When I
>> have non-ASCII-encodable type-1 filenames, I most often use 8859-1
>> for them.  But there's no reason I couldn't use KOI-8, or 8859-7, or
>> 8859-15, or, yes, even UTF-8, if I found it convenient.
> Unfortunately, though, you'll never find it convenient.

That..."turns out not to be the case".  I have found other encodings
and/or no encoding (ie, type-2 filenames which are _not_,
fundamentally, encoded character strings) convenient in the past and I
see no reason to think I never will again in the future.  Indeed, there
is one case I have in live use right now which inverts your mindset:
rather than the filename being, conceptually, binary data which encoded
a meaningful character string, the filenames are octet strings which
encode binary data (the "encoding" is close to trivial, but is
necessitated by one of the ways in which current filesystem names
aren't quite opaque octet strings), and, when they _are_ treated as
character strings, come out meaningless because the meaningful view of
their information content is the binary data on the other end of that
chain of encodings.

> Because the encoding scheme is not recorded with the filename or
> filesystem, it's impossible to know how the string is encoded.

No.  It's impossible to know _in an automated way_ how the string is
encoded.  That is not always a problem.

Unless, of course, you've managed to convince yourself that choice is
bad, that all filenames (type 2) must be encoded character strings, and
that the same encoding must be used for all of them.

> [...]: the name was encoded at creation time.  To be interpreted
> correctly later, the same encoding must be applied.

True as far as it goes.  But most processing does not require
interpreting it at all.

> Because the encoding is a property of the string, and because that
> property is nowhere recorded, we're forced to make some assumption
> about it when it is read and presented to the user.

Of course.  But you seem to consider it a catastrophe if that ever
misfires; I consider the occasional need for human help in getting that
right - or tolerance when it's gotten wrong - to be a totally
acceptable price for preserving the flexibility that the
opaque-octet-string model provides.  (It _would_ be nice if what we had
were closer to a truly encoding-blind opaque octet string model....)

>> it requires that each directory's names must all use the same
>> encoding, at least unless you extend all available filesystems to
>> carry encoding tags with the directory entries' names
> I think by "directory's names" you mean the name of the directory,
> not the names in the directory?

No, I meant "the names appearing in the directory".

>> In my opinion, giving upper layers the freedom to use whatever
>> encoding they find convenient wins.
> There is no freedom in chaos.

That's actually not true; chaos is little but freedom.  But I also see
no reason why it has to produce chaos; after all, that's what we have
now, and have had since at least the '80s, and if this is chaos it's
the best behaved - and most useful - chaos I've ever seen.

> I think you'd agree that the kernel's job is to arbitrate resource
> use and facilitate interprocess communication.  In the case of
> filenames, you suggest processes should "use whatever encoding they
> find convenient",

Indeed.  As they do now.

> but the convenient thing is to share encoding choice (writer to
> reader),

In some cases.  Even more convenient is not not have occasion to care
about encoding, which is true of a lot of processing...unless you break
that model with the notion that some octet sequences are equivalent to
other octet sequences because one particular encoding considers them
that way, even though that might not be the encoding in use for those
names.  Case-folding is probably the commonest such botch at present,
and the troubles it brings are exactly the kind of troubles Unicode
normalization will bring - only not as bad, because case folding
doesn't involve sequences of different lengths being equivalent.

> and they have no way to communicate that choice.

Oh, nonsense.  There are lots and lots of possible ways to communicate
that choice, starting with hardwiring it into the code (yes, that's
occasionally appropriate) to pushing it off to the user (environment
variables and the like) to tagging that information onto the beginning
of the (type-2) filename to storing a MIME charset parameter along with
wherever the filename is stored or transmitted....

> That is why the kernel must be involved.  They also have no way to
> persist that choice in the very filesystem where the choice is
> manifested.  That is why the filesystem must be involved, unless we
> are forever to rely on administrative fiat.

So, you're proposing to solve this by fiat, only your fiat is imposed
by NetBSD rather than the individual sysadmins and imposes the same
encoding on all sites.  I have trouble seeing how this is better than
whatever you had in mind when you wrote of "administrative fiat".

> I would prefer that every fileystem be mounted with an encoding and
> that the kernel enforce that encoding for all metadata operations.

Provided "opaque octet string" is an available encoding (basically,
what we have now: effectively considering each octet as a single
character and each distinct octet value as a different character),
provided compiling out the other encodings removes their overhead, and
provided the userland tools continue to work correctly with that
encoding, I as an admin would be fine with this.

> Tagging each filename with an encoding would be counterproductive.

...for some applications.  For others, it's exactly what you want.

Back to being Procrustean, imposing a one-encoding-fits-all-tasks
attitude on everybody.

> If we didn't have Unicode, or if Unicode had gross defects,

It has defects.  For some purposes, some of them are gross defects.
You might wish such purposes didn't exist....

> Normalization could be done in libc e.g. in open(2) before invoking
> the kernel.

I'm having trouble imagining doing that.  Finding the space to generate
the normalized sequence in strikes me as difficult.  Probably solvable,
though.

> What we cannot do is continue to march around with our fingers in our
> ears shouting that filenames aren't names, that they have no encoding
> because they bear no encoding.

It's almost as bad as decreeing that there is One True Encoding, that
it's suitable for every application, that all filenames shall be
encoded strings, and that that encoding shall be mandated for all uses.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden

References:
- A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden

Prev by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index