tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface

On Sun, Mar 31, 2013 at 06:28:47PM -0400, James K. Lowden wrote:
> Certainly the Worse is Better school would push the problem out to
> userland and absolve the filesystem.   ISTM filenames are there for the
> user's sake, and filename uniqueness is judged at the semantic level of
> linguistic perception.  Leaving him to fend for himself against
> Unicode's unfortunate complexity is a disservice.  

For Unicode complexity, all is not gratuitous. I found this when
thinking about the next step for kerTeX: adding Unicode/UTF-8 for TeX.

Example: in Occident, we use arabic digits. If in occidental languages
and Arabic the "individual" digits are the same, considering that they
are part of a special set, they are not identical. If the digits are,
for occidental languages, in the ASCII range, in the arabic language
they should not. Because, based on the code, one can deduce the
language, and for example the direction of composition. Hence, TeX---to
take this example---could deduce the direction of composition from the
Unicode range.

There is a simple solution, the one developed by Ken Thompson and al.
from the Bell Labs: UTF-8. As long as the system is concerned, the
filenames should be octets strings (UTF-8) and the same filename
is the exact same string. No semantics at all. (I simply hate
filesystem that are case sensitive, and I simply don't want the
disease to go any further. Two different codepoints are two different
characters. Would you want to consider too the font they are rendered
with? Because a same codepoint can have a very different aspect in
two different fonts...)

        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

Home | Main Index | Thread Index | Old Index