Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Tue, 2 Apr 2013 20:27:31 -0400 (EDT)

>       "namei" is called (5770) with a second parameter of zero to
> locate the named file.  ("u.u_arg[0]" contains the address in the user
> space of a character string which defines a file path name.)
>       -- Lions Commentary on UNIX 6th Edition, page 18-3

> A "character string", he says.

Indeed.  I get sloppy with language sometimes too.

All this means is that he was thinking of it as a character string.
That doesn't mean it actually _is_ one.  I regularly think of the
things stored in `char's as characters too.  Doesn't mean they are;
this is a kind of mental shorthand, and, like most mental shorthands,
losing track of the fact that that's what it is leads to confusion.

> Once you accept the idea the filenames are strings and strings are
> encoded, you no longer need defend the airless proposition that
> they're uninterpretable octet sequences.

I'm not sure what "airless" is supposed to mean here.  The context
makes it appear to be a content-free denigratory adjective.  The
question-begging implicit in such a thing aside, it's equally true to
say that once you realize that filenames are just octet sequences, you
no longer need to defend the idea that they have to be encoded
character strings, or that when they are there has to be only a single
encoding in use.

You are still being sloppy with "filename".  There are at least two
things that can reasonably be called filenames.  One of them (let's
call it type 1) is a conceptual thing, occurring in human minds, and
usually either is a character string or is implemented as a character
string, depending on which way you prefer to think of it.  The other
(type 2) is an octet sequence such as is passed to open(2) and related
calls.  Type 2 filenames are used to implement type 1 filenames, but
can also be used for other things.

>>> Unicode combining characters, by contrast, allow distinct byte
>>> sequences to represent identical character sequences.
>> This is a good reason to avoid using Unicode for encoding filenames
>> (that is to say, for converting abstract filenames to encoded
>> strings for use with syscalls).
> I don't see any remotely practical alternative.

The only things I can attribute this to are lack of imagination
(unlikely in your case) or having your mind stuck on the idea that all
type-2 filenames which are encoded type-1 filenames must be encoded in
the same encoding - that is, that your "alternative" must be one single
encoding.

The most common encoding I use for filenames is ASCII (as I suspect is
true for most people who live and work primarily in English) - those
filenames can, of course, also be viewed as being in any superset of
ASCII: 8859-*, UTF-8, KOI-8, what have you.  When I have
non-ASCII-encodable type-1 filenames, I most often use 8859-1 for them.
But there's no reason I couldn't use KOI-8, or 8859-7, or 8859-15, or,
yes, even UTF-8, if I found it convenient.  _That_ is the real
advantage of filesystems that treat (type 2) filenames as uninterpreted
octet strings: they let upper layers use whatever encoding they please,
whatever they find appropriate to the task at hand.  (In my example
this freedom ends up showing through all the way to the human layer; I
have trouble seeing that as a bad thing.  Giving humans multiple
choices, what a concept!)

Yes, this means that it's possible for, say, a type-1 filename that
wants Cyrillic to collide with a type-1 filename that wants Greek
because the KOI-8 encoding of the one happens to be identical to the
8859-7 encoding of the other.  I don't see this as any worse - in fact,
as significantly less bad - than the problems with mandating
UTF-8-encoded normalized Unicode; Unix does not prevent you from doing
stupid things because that also prevents you from doing clever things.

> Unicode has been adopted everywhere: by Microsoft, Apple, and most
> flavors of Linux.

Two large companies and numerous dialects of one open-source OS is
"everywhere"?!

That aside, a popular mistake is still a mistake.  (Whether it _is_ a
mistake is, in part, what we're discussing here; I'm just pointing out
that popularity alone doesn't mean all that much.)

> By "unique", I meant that no two names in the same directory should
> represent the same character string, regardless of their byte
> sequence.

This is impossible to do unless you somehow forbid type-2 filenames
which aren't encoded type-1 filenames.  It also is Procrustean in that
it requires that each directory's names must all use the same encoding,
at least unless you extend all available filesystems to carry encoding
tags with the directory entries' names, which will necessarily render
them on-disk-incompatible with others' filesystems.

In my opinion, giving upper layers the freedom to use whatever encoding
they find convenient wins.  Or, since what we currently have is very
close to type-2 filenames being opaque octet sequences, "giving" should
really be more like "not removing"; not breaking that freedom strikes
me as by far the bigger win than...I'm actually not certain what the
win of enforcing UTF-8-encoded normalized Unicode is.  All the
properties of such a scheme that come to mind amount to "_I_ think this
is a Bad Thing so I'm not going to let _you_ do it", which I have
trouble seeing as a win for anyone.  (Consider those årets_fotos files
mentioned upthread.  Why should _I_ be unable to use those distinct
(type-2) filenames because _your_ favourite encoding happens to
consider them to be the same character sequence?  Just to handhold you
so that you don't have to actually live with all the properties of the
encoding you choose to use?)

If you want to bloat your kernel with Unicode hair and cripple your
users' and programs' ability to use whatever encodings they find
appropriate for their tasks, fine; it's not for me to tell you how to
run your systems.  But imposing even one of those costs on NetBSD, that
I have trouble calling anything weaker than "broken".

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden

References:
- A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden

Prev by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index