tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Sun, 7 Apr 2013 22:38:31 +0200
tlaronde%polynum.com@localhost wrote:

> Question: can "oe" be made a ligature, in the font, that is as soon as
> the 'o' 'e' sequence is found, it is replaced automatically by the
> 'oe' glyph in the font? No. Why? Because there are words in french,
> where the "o" "e" sequence does not yield the 'oe' rendering
> ("coexistence" for example; a lot of the "co" prefixed words).
> 
> Question: is 'oe' a letter of the alphabet? Neither. In french, the
> alphabet has only 24 letters, and this is not a part of it. Is it a
> phonetic sign? Neither, because the sound is mainly the "eu" and the
> "o" is neutral...
> 
> Solution: none. Because one can not automatically convert the 'o' 'e'
> sequence and because as far as the alphabet is concerned, this is not
> a letter meaning that the "semantics" should be described as "c o e u
> r" the special ligature being a rendering effect.

$ man groff_char | grep oe
       oe      \[oe]  oe           u0153

(I often wish troff had gained universal acceptance.  The world might
have been very different had DEC created the VT-roff instead of the
VT-100.)  

You make an interesting point: the user's locale language affects (or
should affect) how names are looked up, even though it has no effect on
how filenames are encoded.  

How?  Let us suppose ad argumentum that the filename is the
string 'c U+0153 u r', meaning that the Unicode code point for the oe
ligature appeared in the filename (however encoded, say, UTF-8).  Now
the user types "ls coeur".  Should the system list the filename, or
not?  

To make the question more interesting, let us further suppose the
filesystem is marked and mounted as having UTF-8 filenames, but with no
locale information.  

If the user's locale is fr_FR, it seems to me the system should consider
"oe" to be the same as "oe" or U+0153 or (IIUC) "o U+034F e", absent
better information to the contrary.  The user should not be forced to
use a wildcard specification to match semantically equivalent strings.

If the user's locale is "C", the argument provided to ls(1) surely must
be interpreted literally, again absent better information.  Even if the
filesystem is marked and mounted as having UTF-8 filenames, userspace
has no way or reason to interpret either string as French, no basis on
which to match the letters "oe" to the codepoint U+0153.  Unless, that
is, the name is somehow marked as French, or there is some way to tell
ls(1) to treat its argument as a French specification.  

But those are heuristics, guesses about the filename's linguistic
intent.  

If OTOH the filename is known to be the French word for "heart" -- i.e.
we know the filename's *language* -- then ISTM "oe" from the keyboard
should match U+0153 or any semantic equivalent, regardless of the
user's locale.  

I doubt very many people are going to learn to type "ls c?ur".  It would
be easier to teach them to type "ls c\\(oeur"!  Some encoding
interpretation and mediation must reside between the keyboard and the
filesystem.  

--jkl


Home | Main Index | Thread Index | Old Index