tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface

On Sat, Apr 13, 2013 at 07:58:19PM -0400, James K. Lowden wrote:
> The user should not be forced to use a wildcard specification to 
> match semantically equivalent strings.

No, the policy shall be that the filenames are encoded to be strings of
language graphics real atoms ("letters" or ideograms or whatever)
without any _rendering_ effect. "oe" is not a letter, but typographic
sugar adding etymology. The same goes for "ffi" "ffl" that are
typographic sugar (purely visual ones) and, furthermore, that do
not exist in every font!  There is no ligature in the Computer
Modern fixed size fonts, so entering the codepoint for the ligature
"ffi" will only do for one font, and not for another; while keeping
'f', 'f', 'i' in the source allows a different rendering depending
on the font without loosing information (because the font is organized
by its designer to give the best visual result; the user shall not
enforce something that the font designer did not want).

The problem with Unicode is that it is a numerical, computer aimed
encoding, not only encoding grapheme language atoms but also
typographic signs ("oe", "ffi", "ffl" as ligatures), controls etc.

The policy should be that a filename is a mean to humanly match a
software resource _in an heterogeneous network_, in an uniq conventional
user independant way, and that pretty printing of a directory
listing is not the problem of the system. (and the default "filter"
should be that when an user enters a name with typographic sugar
in it, a dismissal letter for him is automatically sent to the
printer, explaining that an organization would feel like an egoïst
to keep such a creative guy for itself, depriving the humanity)

The system stores or loads strings of octets (bytes). The best
solution is to consider these strings as UTF-8, but---and this is
only a recommendation: the system provides rope---the user shall
enter filenames (for storing) with the canonical graphical
representation of the name (letters, not typographic rendering)
and the system treats two names equal when strcmp(a, b) == 0. UTF-8 is
to get rid of localization! Not to introduce typographical nightmares
worse than POSIX localizations!

Please believe me: I have spent enough time with the TeX system, due to
kerTeX, to have some doubts about how things have to be done... but to
not have doubts anymore about what shall _not_ be done!

The solution is perfect not when there is nothing more to add, but when
there is nothing more to remove: localization: out; typography: out (of
the system calls).

        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

Home | Main Index | Thread Index | Old Index