tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



tlaronde%polynum.com@localhost wrote:
 |For Unicode complexity, all is not gratuitous. I found this when
 |thinking about the next step for kerTeX: adding Unicode/UTF-8 for TeX.
 |
 |Example: in Occident, we use arabic digits. If in occidental languages
 |and Arabic the "individual" digits are the same, considering that they
 |are part of a special set, they are not identical. If the digits are,
 |for occidental languages, in the ASCII range, in the arabic language
 |they should not. Because, based on the code, one can deduce the
 |language, and for example the direction of composition. Hence, TeX---to
 |take this example---could deduce the direction of composition from the
 |Unicode range.

It's even worse, since some languages use different conversion
systems (like base 20), don't know about the value 0 and/or have
special symbols/characters for several importan numbers, like
"1000" etc.  Of course a digittoi() cannot handle these cases (and
afaik Unicode didn't put any effort in this, a digit value is only
defined if a direct mapping is possible).

So, for this, some locale-dependent pre/after parser is or would
be necessary -- neither do i know of any implementation that
really does, nor does the current POSIX / C environment offer
a way to implement such pre/postprocessors.  But i also wouldn't
really worry about that, since the Innuit and the Indians and the
like have brand new writing systems that they didn't invent on
their own, and which use a LATIN-ish notation, and other languages
are dead and buried, and the rest also doesn't matter.  So for the
computer programs we talk about, at least.

--steffen


Home | Main Index | Thread Index | Old Index