[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: A draft for a multibyte and multi-codepoint C string interface
On Mon, 15 Apr 2013 11:05:50 +0200
> If there are user level tools to filter the "ls" output to match the
> variations (accented, not accented; capitalized, not capitalized;
> ligatures, no ligatures), fine. But user level.
If I understand you correctly, the most important point in this
discussion is that the kernel must make no interpretation of the
> what do you get for writing the ligature 'oe' in naming a resource
> 'oeuvres' instead of the plain letters?
What do you mean by "plain" letters? ASCII? Perhaps my example was
poorly chosen, because the "oe" ligature is only a custom. Too many
languages cannot be represented, even crudely, with ASCII.
What you get is the user's ability to name things in his native
> What has this to do with a computer resource?
"There are only two problems in computer science. Cache
coherency and naming things."
I'm interested in useability. We already *permit* filenames to be
encoded with UTF-8, but we don't *support* them. We permit two
filenames in one directory whose letter sequence is identical if the
byte sequence differs. The sort order is arbitrary: "coeur" and
"c?ur" don't sort next to each other, although they should. The user
has no way to know nor reason to care whether "året" uses four
Unicode code points or five. If he types "vi året", I think the file
should open if the character strings match regardless of the
byte-sequences, but today the odds are 1:4 against.
Who considers that state of affairs good?
I'm confident that glob(3) could be adapted to Unicode, that open(2)
could canonicalize, that ffs could be changed to reflect the encoding,
and mount(2) to enforce it. That's just a small matter of
programming. For it to happen, though, we need consensus that's it's
good and necessary. A consensus that seems surprisingly hard to
Main Index |
Thread Index |