Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%netbsd.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: "James K. Lowden" <jklowden%schemamania.org@localhost>
Date: Wed, 17 Apr 2013 22:53:24 -0400

On Mon, 15 Apr 2013 23:36:15 -0400 (EDT)
Mouse <mouse%Rodents-Montreal.ORG@localhost> wrote:

> > We permit two filenames in one directory whose letter sequence is
> > identical if the byte sequence differs.
> 
> Sure.  Like "HeLLo and "hello".  Some filesystems consider those to be
> the same name.

The issue of case has been raised several times in this thread,
never by me.  I don't know why it's such a hobbyhorse.  Users can choose
to use filesystems that are case-senstitive or not.  As far as I know,
it's impossible to create two files in one directory in HFS+ or NTFS
whose names differ only by case.  The filesystem itself will allow only
one of them, because to it they are the same.   You may sneer because
naïve users lose so much of their namespace, but at least the system is
consistent.  

The UTF-8 filename situation on NetBSD today is much worse.  The user
has no reasonable way to know how a filename is composed, no way to
enforce canonicalization, no way to search for the various possible
permutations of the name.  

> > The sort order is arbitrary: "coeur" and "c?ur" don't sort next to
> > each other, although they should.

> (b) they can be made to by using encoding-aware sorting code in
> whatever program is doing the sorting. (Which actually has to be
> language-, or at least locale-, aware too; consider the ae-vs-æ
> example, [..]

Yes, true.  I still mantain that if the kernel will not enforce
canonicalized uniqueness, libc must.  

> > The user has no way to know nor reason to care whether "året" uses
> > four Unicode code points or five.
> 
> Or no Unicode anything, if the user doesn't happen to find Unicode
> appropriate for the task at hand.

It cannot be up to each user of the filesystem.  That is where we
began: if the filesystem has no known encoding, no user can know how to
decode any filename.  While it's technically possible to record the
filename's encoding per-directory or even per-file, I assert it's both
infeasible and undesirable.  

> I think this is one of the most fundamental disagreements between us:

(I think we just disagree about the nature of freedom and chaos.)  

> > If he types "vi året", I think the file should open if the character
> > strings match regardless of the byte-sequences, but today the odds
> > are 1:4 against.
> 
> Where'd you get 1:4?  

Either the user's provided string or the filename could use the single
codepoint U+030A or two, the combination of 'a' and the COMBINING RING
character.[1]  That's 4 combinations.  

Does that mean a 25% success rate?  No, that was a small joke.  This we
know for sure, though: If system A uses one scheme and the filesystem
is also (or later) mounted on system B that user the other,
manipulating those files on system B will be difficult, to say the
least.  The odds of getting it right will be 0% (with some degree of
rounding).   

It need not be two systems, either.  NetBSD offers no
filename canonicalization.  One user can create the file by pasting the
name from xpdf; that same user may not be able to open it from the
command line because he cannot recreate the byte-sequence.  

[1] http://www.unicode.org/faq/char_combmark.html#2

> I don't want "vi året" to match a filename in the filesystem whose
> octet sequence is different from the one generated when I typed.  Not
> even if the octet sequences in question represent the same character

I do not understand that point of view.  I fail to see any advantage in
distinguishing between two identical strings differently encoded.  

> in an encoding which you think should have some kind of special
> status.

I'm not granting UTF-8 special status.  Like democracy, it's
the worst except for all others tried previously.  

I'm saying only that if you mount your filesystem with UTF-8 names, the
system would be easier to use if name-matching is based on characters,
not bytes.  

I see no better choice than to store only canonicalized names and to
convert input strings to canonical form before comparing.  Provided, of
course, that the filesystem has a declared encoding for filenames.  

> > I'm confident that glob(3) could be adapted to Unicode, that open(2)
> > could canonicalize, that ffs could be changed to reflect the
> > encoding, and mount(2) to enforce it.
> 
> Could be, yes.
> 
> > That's just a small matter of programming.
> 
> I think it's less small than you seem to.  But until/unless someone
> tries to do it, we can't really know.

True.  But if the denizens of tech-userlevel do not agree -- in the
sense of rough consensus and running code -- that it's a worthwhile
goal, it's unlikely anyone will try, and near certain no one will
succeed in getting it adopted.  

--jkl

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse

References:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse

Prev by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: CVS commit: src/lib/libc/locale
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index