tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface

On Sun, Apr 14, 2013 at 07:56:55PM -0400, James K. Lowden wrote:
> French schoolchildren, as you well know, are taught to overlap the "oe"
> in coeur when they write it.  

It is typically something that nobody, in a school, will count as an
error, simply because there are enough variants with handwriting that
no-one will try to emforce _calligraphy_ to that level, calligraphy 
that can be seen as fine point typography in handwriting.

This special ligature is a fine point rendering of etymology (because
latin or greek had a leading 'o') and phonetics, because the 'o' is not 
fully heard (and there are variations in phonetic rendering between say
"oeuf" and "oesophage" some people pronouncing the latter with 'o'
'é'...). But the problem is that "this depends" and that the sound is
special. So we have two letters 'o' and 'e' that have to be retained
(for etymology) and none can be completely dropped for the sound. And in
an alphabetical language, this is a problem, because one has to deal
with letters, and in this case the link is between two distinct letters
that have to fad a little, to combine a little, to disappear individually
a little: hence the ligature.  But this is a ligature, a fine point,
between two _distinct_ letters.

The french orthograph imposes accents. But accents are on _letters_, not
a group of letters (french is alphabetical).

> $ echo 'c\(oeur \(finance' | groff -Tps >
> $ pstopdf -o oe.pdf

The problem is here. What do you use for this? A _typographic_ program.
And we are speaking about an operating system.

The system has nothing to do with _typography_ at the core. It does not
speak "english": it speaks "C" or "Unix" or something else.

The grammar (the rules) are neither the ones of the english tongue or
another natural tongue. The only superficial link is to allow to
form tokens from an extended set of "characters". As
explained in the "Hello World" paper, it is quite comfortable to
be able to use the greek letter 'pi' when one wants, in whatever
"language" to do so---but where did they used these "special"
letters in the system? Typically for naming devices that is objects
an average user will not deal with or not have to deal with (the
leading '#' is here also to put the obligation to really really
mean it).

But these are just tokens, they are not english, or french, or
greek or whatever: the grammar is still the OS one. The character
sets is just an extended range of signs to nickname objects.

So the "semantics" of a tongue have not to go there, since the
"semantics" of the filename, as far as the OS is concerned, are the
computer resources designed by this nickname.

Dictionaries and typography have not to plague the _system_ since
french, greek, hebrew and even: english! are alien languages for the OS:
it speaks C, it speaks Unix, it speaks Plan9, it speaks Windows etc.

If someone wants to impose, on his network, a special policy to name 
resources shared, that's an administrator task and an user level task.
It has not to go on the system level. The system takes and gives
filenames "as is", and the best is a bytes string; UTF-8 is bytes 
strings, so allows to consider these strings as Unicode; but the
system does not interpret and does not care.

If someone wants to treat all the flavors and try to guess "what the
user meant", good for him. But if someone wants to booby-trap his
system, I fail to see why others should have the obligation to wear an
explosive belt also!

The "meaning" of a filename is the file. To allow the user to nickname
the resource close to his tongue is handy. But the nicknames means
something in his tongue for him, but does only mean one thing in the
system tongue: this is _this_ resource, and not another.

A policy should be enforced by administrators on their networks. But no
policy shall be enforced on the system except the system ones: correct
C, Unix, Plan9, Windows etc. "language".

If there are user level tools to filter the "ls" output to match the
variations (accented, not accented; capitalized, not capitalized;
ligatures, no ligatures), fine. But user level.

And I'm quite confident that once someone will have try, for weeks or
months, to accomodate for "creativy", he will realize that this is an
herculean task for strictly no benefit : what do you get for writing the
ligature 'oe' in naming a resource 'oeuvres' instead of the plain
letters? It is more "typographically" correct? What has this to do with
a computer resource? He will then go back to sensible brute force: 
"Users, the filenames have to match these rules. You deviate: you're 

Remember: KISS! Why did Plan9 have things right in this area, rapidly?
Because they allowed what was a huge gain, without tons of
complications, and by letting fuzzy mainly rendering things outside. 
This is in the paper and after thinking about the problems, one
returns in these very same tracks.

        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

Home | Main Index | Thread Index | Old Index