tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



>> For example, in filenames, I would consider <combining-acute><e> to
>> be different from <e-acute>, just as today I consider
>> <underscore><backspace><a> in a filename different from
>> <a><backspace><underscore>.
> Those situations are not analogous.

I disagree.  I do not want the filesystem interface to mangle the
provided byte (historically meaning "octet") sequence by assuming it
knows something about the semantics of it; I believe that the
filesystem interfaces - at the syscall level, at the on-disk level -
should be completely encoding-agnostic.  (Historically, they haven't
quite been.  I would support efforts to fix that.)

> For reasons of security and convenience, the filesystem must
> canonicalize Unicode filenames.

The filesystem has no business knowing whether the filename is Unicode
or not - or, more precisely, the name at the filesystem interface
cannot be Unicode because it isn't a character string at all...except
in human minds.  It is and should be a byte string, which for human
presentation may be interpreted as a character string...or which may
never be presented to a human at all, or anything in between.  (This is
part of the reason I dislike case-folding for filenames even in its
historical manifestations.)

The issue of "convenience and security" is spurious.  When there is
ambiguity, it is the job of user-interface tools to either conceal it
or display it unambiguously, as appropriate.  For example, today,
"x y" will often display the same as "x y", but tools like ls massage
filenames before displaying them so that that potential ambiguity does
not become actual.  Similarly, the difference between
<combining-acute><e> and <e-acute> needs to be rendered with something
analogous when it matters.

> Would you be happy with three files in one directory named
> "årets_fotos" ("the year's pictures"), simply because each happened
> to be represented in the dirent with different codepoint sequences?

Happy with in what sense?  Do I think the filesystem should permit it?
Absolutely.  Do I think it has potential to confuse users?  Of course.
Are those inconsistent?  No; it is not the filesystem's job to keep
users from being confused.  Similarly, "x<tab>y" and "x       y" may
look similar or identical, as may, as I mentioned above, "x y" and
"x y".  I see nothing wrong with any of this.  For that matter,
consider three files named IlI, lll and lII, which in some fonts render
identically and in lots more almost so.  Even the font I'm using to
type this has only one pixel of difference between the relevant glyphs.
Similarly, there's no reason <A-ring> and <angstrom> have to look
identical, even if some fonts do make them so.  O vs 0.  m vs rn.  This
is very far from a new issue.

> In what sense would that be a Good Thing?

In the sense that it keeps a messy human-interface issue - what byte
sequences are used to represent character strings, and, if so, what
encoding is used - out of fundamentally software-to-software interfaces
(syscalls and on-disk data).

In the sense that it means that programs using filenames for
non-human-presentation purposes are not forced to know all the ins and
outs of Unicode (itself a moving target) just in order to handle
filenames without tripping over cases such as two different byte
sequences being considered identical.

> Certainly the Worse is Better school would push the problem out to
> userland and absolve the filesystem.  ISTM filenames are there for
> the user's sake, and filename uniqueness is judged at the semantic
> level of linguistic perception.

That isn't what Unix has been historically.  It would probably be
possible to twist Unix into such a thing, but it would be difficult,
ugly, and would lose the clean design that makes Unix so powerful.
"Unix does not prevent you from doing stupid things because that would
also prevent you from doing clever things.".  For that matter, "Il
semble que la perfection soit atteinte non quand il n'y a plus rien à
ajouter, mais quand il n'y a plus rien à retrancher.".  Don't add more
special cases.  Get rid of the ones already there instead.

> Leaving him to fend for himself against Unicode's unfortunate
> complexity is a disservice.

This is true.  But pushing that unfortunate complexity into a
software-to-software interface isn't a right answer.

One right answer is to use something saner than Unicode.  Given all of
the issues trying to use it raises, I'm somewhat surprised nobody else
is pointing out that if you have to convolute software within an inch
of its life to work with Unicode, perhaps that indicates that Unicode
is not a right character set for the job.

Another is to put that complexity in the user-interface layers.
Pushing user-interface issues into software-to-software interfaces just
leads to complex, messy software interfaces full of ugly special cases,
difficult to use in any way the original designers didn't anticipate.

Unix shines because it provides clean, orthogonal facilities which turn
out to be useful in many unanticipated ways.  I am in no position to
prevent NetBSD from deviating from that design, nor does it matter to
me in a pragmatic sense (I already have lots of reasons I won't be
using "modern" NetBSD).  But I would be saddened to see NetBSD go that
far astray from the principles that made Unix so great.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index