tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



> I'm interested in useability.  We already *permit* filenames to be
> encoded with UTF-8, but we don't *support* them.

That's what "support" _means_, in a lot of cases.  "Do you support
filenames longer than 14 characters?"

> We permit two filenames in one directory whose letter sequence is
> identical if the byte sequence differs.

Sure.  Like "HeLLo and "hello".  Some filesystems consider those to be
the same name.

> The sort order is arbitrary: "coeur" and "c?ur" don't sort next to
> each other, although they should.

(a) that's encoding-dependent (whatever octet sequence it is that you
think of as the oe ligature may mean something completely different to
whoever created the file); (b) they can be made to by using
encoding-aware sorting code in whatever program is doing the sorting.
(Which actually has to be language-, or at least locale-, aware too;
consider the ae-vs-æ example, where the linguistically-appropriate sort
order for æ differs between English or Norwegian (and maybe others).)

> The user has no way to know nor reason to care whether "året" uses
> four Unicode code points or five.

Or no Unicode anything, if the user doesn't happen to find Unicode
appropriate for the task at hand.

I think this is one of the most fundamental disagreements between us:
you seem to want user interfaces to hide such details, while I want
full visibility into what's really there (see below).  And, you push
the hiding line so far that it actually crosses into the kernel; I
think that is user interface stuff and belongs in userland, in, well,
user interface code.

> If he types "vi året", I think the file should open if the character
> strings match regardless of the byte-sequences, but today the odds
> are 1:4 against.

Where'd you get 1:4?  That seems to me to presume probabilities for a
number of things which I doubt you have even moderately precise numbers
for, such as the chance that the file was created using a different
encoding from the one used on the vi command line.

vi and/or the shell could actually do pretty much this now, if they
felt like it, by using a glob(3)-alike that considers any character
with more than one representation - or, more precisely, any octet
sequence which represents a character which has more than one
representation - to be a globbing wildcard matching any of its
representations.

> Who considers that state of affairs good?

Me, for one.  I want vi and shells, and command-line tools in general,
to give me visibility into what's really there.  Not some
equivalence-class mangling of it.  Nor do I want them, and even less
stuff on the kernel side of the privilege divide, to inflict any
particular encoding, especially not one as broken (for many purposes)
as UTF-8, on me.

I don't want "vi året" to match a filename in the filesystem whose
octet sequence is different from the one generated when I typed.  Not
even if the octet sequences in question represent the same character in
an encoding which you think should have some kind of special status.

> I'm confident that glob(3) could be adapted to Unicode, that open(2)
> could canonicalize, that ffs could be changed to reflect the
> encoding, and mount(2) to enforce it.

Could be, yes.

> That's just a small matter of programming.

I think it's less small than you seem to.  But until/unless someone
tries to do it, we can't really know.

> For it to happen, though, we need consensus that's it's good and
> necessary.  A consensus that seems surprisingly hard to establish.

I would say, _reassuringly_ hard to establish.  But that difference
probably reflects nothing but which sides of the issue we're on.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index