tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface

> Also "cooperation" used to have a 'double dot' over the second 'o' to
> indicate that there are two short 'o' sounds.

Yeah, one of the few cases where LATIN SMALL LETTER O WITH DIAERESIS is
actually what Unicode calls it.  (In some languages it's an umlaut, not
a diaeresis; in others, it's a separate letter, not a modified o in any
sense except typographically.)

Come to think of it, that's another issue with Unicode, for some
purposes: it not only provides multiple ways to represent some things,
it conflates semantically distinct but typographically identical things
(like `o modified by adding a diaeresis', `o modified by adding an
umlaut', and `distinct letter graphically identical to either of the
foregoing two').  It's a confused mess that sometimes appears to be
designed for typography, drawing typographically significant but
semantically irrelevant distinctions (such as having a separate
codepoint for the fi ligature) and sometimes appears to be designed to
draw semantically important but graphically irrelevant distinctions
(such as having different codepoints for LATIN CAPITAL LETTER A and

And then there are cases where it's not possible to know how a glyph
(and/or codepoint, eg, 0xe6 in 8859-1 or Unicode 00e6) should be
handled without knowing the language in question.  `æ', to continue
that example, is just a typographical frill in English, somewhat akin
to tlaronde's description of oe in French (`encyclopædia' and
`encyclopaedia' are linguistically the same thing) but a distinct
letter, with its own position in the alphabet and everything, in Danish
or Norwegian.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML      
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Home | Main Index | Thread Index | Old Index