tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Sun, Apr 07, 2013 at 03:55:32PM -0400, Mouse wrote:
> 
> > [...].  And I hate case insensitive filesystems, so I don't want an
> > enforcement of some cryptic policy deciding that two distinct strings
> > passed are, indeed, the same thing: this is a user decision, not a
> > system one.
> 
> You and I are in furious agreement here.  But people - as I said,
> mostly jkl I think - have been arguing that this is indeed something
> that belongs in the kernel.

So, the following example is not for you, since we mostly agree about,
at least, what has not to be at the system call level.

This is one thing I stumbled upon when thinking about extending TeX (for
kerTeX) to be able to digest the Unicode range (this will be still 8 
bits, since it will be UTF-8---but this is not the subject).

In french, some words like "coeur" are typographically rendered with
"oe" as a special glyph (a ligature of 'o' and 'e'). 'o' and 'e' are in
the ASCII range; 'oe' (as a single glyphe) is not in the ASCII range.

Question: can "oe" be made a ligature, in the font, that is as soon as
the 'o' 'e' sequence is found, it is replaced automatically by the 'oe'
glyph in the font? No. Why? Because there are words in french,
where the "o" "e" sequence does not yield the 'oe' rendering
("coexistence" for example; a lot of the "co" prefixed words).

Question: is 'oe' a letter of the alphabet? Neither. In french, the
alphabet has only 24 letters, and this is not a part of it. Is it a
phonetic sign? Neither, because the sound is mainly the "eu" and the "o"
is neutral...

Solution: none. Because one can not automatically convert the 'o' 'e'
sequence and because as far as the alphabet is concerned, this is not a
letter meaning that the "semantics" should be described as "c o e u r"
the special ligature being a rendering effect.

And to "convert" automatically, one will have to put a dictionnary in
the system, on the kernel level, because no automatic handling of "o"
"e" can be done without looking for the whole word, since "it depends".

I think that the Unicode has also the phonetics signs? Does this mean
that the kernel should have all the dictionaries to convert the
phonetics sequence to "something" trying to match a correctly spelled
word in... what dictionnary? Because the same sounds can be different
words in different languages? Does this mean that the use of Unicode
that should allow to get rid of localization has to be associated with
localizations, and not one localization but a stack of localizations in
decreasing order of precedence, to allow (Unicode is for this) to mix
different languages in the same stream of "characters"?

This is a user policy and even at user level a headache. When there is
no good solution, but equally good solutions, this means it is not
important and someone has to make one choice, and that's it. When "it
depends", on the context, it is a user policy and be left there.

FWIW
-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Home | Main Index | Thread Index | Old Index