Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%netbsd.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: tlaronde%polynum.com@localhost
Date: Tue, 16 Apr 2013 01:04:47 +0200

On Mon, Apr 15, 2013 at 05:51:33PM -0400, James K. Lowden wrote:
> On Mon, 15 Apr 2013 11:05:50 +0200
> tlaronde%polynum.com@localhost wrote:
> 
> > If there are user level tools to filter the "ls" output to match the
> > variations (accented, not accented; capitalized, not capitalized;
> > ligatures, no ligatures), fine. But user level.
> 
> If I understand you correctly, the most important point in this
> discussion is that the kernel must make no interpretation of the
> filename.  

Yes: as far as the kernel is concerned, the system calls deal only with
a nul terminated octets string (it happens to be the legacy); UTF-8 was
designed precisely for this compatibility (and further more to be usable
when dealing with filenames: the ascii letters do not appear in the
encoding of the out of ASCII range).

As far as the kernel is concerned, same resource identifier is 
strcmp(a, b) == 0. And that's all.
> 
> > what do you get for writing the ligature 'oe' in naming a resource
> > 'oeuvres' instead of the plain letters? 
> 
> What do you mean by "plain" letters?  ASCII?  Perhaps my example was
> poorly chosen, because the "oe" ligature is only a custom.  Too many
> languages cannot be represented, even crudely, with ASCII.  
> 
> What you get is the user's ability to name things in his native
> tongue.  

I mean that in french, there are accented letters that are significant.
And there are typographical sugar. A computer does not speak a native
tongue. To allow to use a wider range of numbers, translated to a wider
range of glyphes is OK. But for the OS, they are just numbers, encoded
in a null terminated octet strings that have one and only one meaning:
this resource.

> 
> > What has this to do with a computer resource? 
> 
> I'm interested in useability. 

To drop codepages, and to mandate that, on the user level, UTF-8 is the
rule (bye-bye localization and so on) allows this. But this has been
done for Plan9 and this should be considered a reference. Specifically,
what has be done, and what has not be done.

> We permit two filenames in one directory whose letter sequence is 
> identical if the byte sequence differs.  (...)

An OS provides rope. We allow case sensitivity and I (and others) use
it. Because a first capital, for me, is a proper noun, or a class noun,
while a first minuscule is an instance, for example.

The only thing that is mandatory, is that if I use one identifier
for my resource, when I reuse the same identifier I have the
resource. Windows comes as an example where filenames are not stored
"as is", or perhaps stored "as is", but listed with capital being
displayed, or not, depending on the version of the OS, the version
of the filesystem, the position of the Moon etc., with the result
that if you enter something definit in the filesystem, trying to
access this very same identifier under a case sensitive Unix, you
get no match, because "creativity" have decided that your string
has to be rendered differently and because in the network you have
indirect access to resource: you get what the other side is telling. Do
you mean that a fileserver will have to serve the very same filesystem
listing differently, rendering differently the names depending on the
localization of the _client_? Or that case sensitivity has to be dropped
because coeur, and Coeur and COEUR have the same "meaning"?

If an user enters what he wants to be the very same thing, differently
each time, this is not the OS that is the problem, this is the poet who
sits in the chair...

The same goes for password, quotas and so on: in an heterogeneous
network, with several users, the administrator sets rules. This goes 
for the filenames too.

If someone wants to put a factotum that intercepts creation of
filenames, or search for filenames implementing some policy, this is
fine. But this has nothing to do in the kernel, neither something to do
in the base by default: if I enter two distinct strings, I mean
it. I can enter "coeur" for an ASCII text discussing the
organ; I can enter in the same directory the typographical TeX
version as "c\oe ur":  Who will decide that I have not the right?
Because this is exactly the case: the two files for _me_, the user,
speaks about the very same thing, but the rendering is meaningly
distinct. And I mean it.

What would be a great step, is the ability for the text utilities to
deal with UTF-8. But this is in user space, because it is where it
belongs.

Furthermore, Unicode is kind of a mess (for example, instead of letting
the initial noise of the encodings but trying to clean for new things,
and allocating clear slots of numbers and repeating for example the 
mathematical symbols present in the first 256 block in a 
dedicated range, they have only added supplementary symbols in another
range--- minus-or-plus is 2213 while plus-or-minus is still only 00B1).
Unicode is not something fixed, perfect, set one and forever. It
is not something to base an OS upon. The UTF-8 scheme is something
that could be adapted to whatever encoding appears later. Keep the
OS clean with nul terminated octet strings. Enhance utilities to
deal with UTF-8 (and not runes for exchange; runes are only internal).
And keep it agnostic about the meaning of the glyphes.

This scheme does _not_ prevent what you want to do but do allow to not
do it by default, or to do it differently, or to adapt to whatever new!
enhanced! definitive! version of the code will be "forever" valid for
a limited slot of time.

Whether on the engineering side or on the user ergonomy side, I fail to
see where the problem lies...

-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden

References:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden

Prev by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index