tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



James K. Lowden wrote:
> The sort order is arbitrary: "coeur" and "c?ur" don't sort next
> to each other, although they should.

At kernel level, I do not understand how you can say
- that they do not sort next to the other (since AFAIK there is no sort
involved)
- that they should sort next one to the other, or not (since this is
obviously linked with linguistic preferences of some "user")

Now if you consider ls(1) level, I believe POSIX's localedef(1) provide
more than enough material to play here, and you certainly can force it
to behave the way you want there; some actually achieved it. I also
believe years told us this is not the paramount some of its inventors
apparently thought it was, and the current base of POSIX users does not
seem to agree that path should be more developed.


> The user has no way to know nor reason to care whether "året" uses
> four Unicode code points or five. If he types "vi året", I think the file
> should open if the character strings match regardless of the
> byte-sequences, but today the odds are 1:4 against.  

First, let me note problems of this class only happen in heterogeneous
environments, like people working on different computers on remotely
mounted file systems; there are very low probabilities that on a given
machine, outside deliberately done for test purposes, someone creates a
file name encoding using four code points (combined ?å, U+00E5), then
try to access it through another encoding using five code points (a and
combining °, U+030A); and obviously such a case would rather indicate a
bug (or inconsistency?) in the way Unicode would be handled within the
system, like being forced NFC in some path while forced NFD in others.

Then while the case you describe here seems horny, this behaviour is
probably the correct one to use as default: unless special instructions
are given, it seems to me adequate to drop the request and announce to
the user that there are no such file as "a\u030Aret" as she asked: it
allows to avoid an entire class of confusions which happen when glob()
performs as the specifications say but against the user's intent.

Furthermore, if such a case happens often, I am certain the interested
user will learn to work around it, perhaps using "vi [[.å.]]ret" or
similar already-available features, perhaps using tab-completion if it
deals correctly with the issue, or "a better shell" (for his purposes.)


Antoine


Home | Main Index | Thread Index | Old Index