Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Thu, 18 Apr 2013 14:28:45 -0400 (EDT)

>>> The user has no way to know nor reason to care whether "året" uses
>>> four Unicode code points or five.
>> Or no Unicode anything, if the user doesn't happen to find Unicode
>> appropriate for the task at hand.
> It cannot be up to each user of the filesystem.

I don't see why not.  That's what we have today, and it works well
enough for many purposes.  Sure, it's got problems - but so does your
plan, and so does every other; it's just a question of which sets of
problems are least problematic for which users and tasks.

> [I]f the filesystem has no known encoding, no user can know how to
> decode any filename.

That is simply false.  Today, for example, I have a number of (type-2)
filenames which result from encoding (type-1) filenames in 8859-1.  (I
might have a few in other encodings; I can't think of any offhand, but
I have enough filenames there are certain to be a lot that have been
paged out of my wetware.)  I know how to decode every last one of them.
And the filesystem has no encoding, known or unknown - those file names
do, but the filesystem doesn't.

> While it's technically possible to record the filename's encoding
> per-directory or even per-file, I assert it's both infeasible and
> undesirable.

You've been asserting a lot of things, mostly without any backing and
in the face of substantial disagreement from others on the list.  Is
there any particular reason this assertion should be given any more
credence?

>>> "vi året" [...] character strings match regardless of the
>>> byte-sequences, but today the odds are 1:4 against.
>> Where'd you get 1:4?  
> Either the user's provided string or the filename could use the
> single codepoint U+030A or two, the combination of 'a' and the
> COMBINING RING character.[1]  That's 4 combinations.

Yes, but (a) you're assuming only one of those four works, when
actually, two of them do (as long as the provided string and the name
in the filesystem match, it'll work), and (b) "4 combinations" and
"odds are 1:4 against" are equivalent only when each combination is
equiprobable, which is unlikely in practice (especially since the four
possibilities are the cross product of two far-from-independent makings
of a single two-alternative choice).

> If system A uses one scheme and the filesystem is also (or later)
> mounted on system B that user the other, manipulating those files on
> system B will be difficult, to say the least.

Only if (a) ls (or equivalent) conceals the actual octet sequence from
the user, presenting only the character sequence, and/or (b) the shell
(or equivalent) makes it difficult to generate that octet sequence.
See below.

> It need not be two systems, either.  NetBSD offers no filename
> canonicalization.  One user can create the file by pasting the name
> from xpdf; that same user may not be able to open it from the command
> line because he cannot recreate the byte-sequence.

Only if the shell makes it difficult-to-impossible to create the
relevant octet sequence.

I note that both this and the previous quote's difficulties are cured
(or at the very least workarounds made easy) by recognizing that
(type-2) filenames are octet sequences, octet sequences which may be
generated from character sequences and which may be used to generate
character sequences, but still fundamentally octet sequences, and
making userland tools behave accordingly - or, since that's
kinda-mostly what we have now, not breaking the userland tools.

>> I don't want "vi året" to match a filename in the filesystem whose
>> octet sequence is different from the one generated when I typed.
> I do not understand that point of view.  I fail to see any advantage
> in distinguishing between two identical strings differently encoded.

Because the system has no way of telling whether they _are_ two
identical strings differently encoded, instead of two different strings
(perhaps encoded with the same encoding, perhaps not).

You would, near as I can tell, "solve" this by mandating the same
encoding everywhere, or at least everywhere on a single filesystem.
This introduces one of the problems you sketched above, and, if the
admin has any choice of encoding, the other one as well if the two
accesses have an admin-chosen encoding change between them.  (That's in
addition to all the other problems it introduces, of course.)

>> an encoding which you think should have some kind of special status.
> I'm not granting UTF-8 special status.  Like democracy, it's the
> worst except for all others tried previously.

Maybe for _your_ purposes.  I find other encodings more useful for most
purposes, for a variety of reasons.

>>> That's just a small matter of programming.
>> I think it's less small than you seem to.  But until/unless someone
>> tries to do it, we can't really know.
> True.  But if the denizens of tech-userlevel do not agree -- in the
> sense of rough consensus and running code -- that it's a worthwhile
> goal, it's unlikely anyone will try,

Oh, nonsense.  You can try it anytime you like, no matter what
tech-userlevel thinks.  I've got lots of things in my source tree
NetBSD doesn't want, some of which could reasonably count as
experimental extensions (eg, AF_TIMER sockets); you could do likewise
with this if you felt like it.  tech-userlevel's opinion becomes
relevant only if you want it to be checked into the main NetBSD tree.

> and near certain no one will succeed in getting it adopted.

No one will succeed in getting it adopted until it's implemented.  If
you think this is a good idea, especially if you think it's a _small_
matter of programming, I'd suggest actually trying it.  Coming back
here with a report of the form "I tried this, and here's where it
succeeds and here's where it breaks for me; here's where you can fetch
the patches to try it yourself" would be an _excellent_ idea.  We can
then debate with something at least vaguely approaching real data.
I've seen lots of opinions in various directions, even contributing a
few myself :), but all the opinions in the world aren't worth anything
compared to real data.

I'd try it myself except that (a) I don't have a lot of time for
personal software writing at all these days and (b) I prefer to put my
own time into writing software I expect to enjoy using, or at least not
find actively distasteful to use.  I could be wrong in my expectation,
of course, but I still prefer triage my projects-to-try based on my own
estimation of how much I'll like them.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

References:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden

Prev by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: CVS commit: src/lib/libc/locale
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index