tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface

On Wed, Apr 17, 2013 at 01:31:43PM +0200, Steffen Daode Nurpmeso wrote:
> But Unicode collation is somewhat locale-specific, in permanent
> transition and very complicated.

This is one supplementary reason to not deal with this in the kernel.
(There are other fundamental reasons.)

> Plan9 simply doesn't know about
> locales at all (?) and gives a s..t about any standards. 
UTF-8 is a standard, and comes from Plan9 ;) Furthermore, Plan9 speaks
POSIX with the APE environment (the very same appears with Windows:
Interix speaks POSIX as an environment too---I added Interix support in
my RISK framework for kerTeX, and it was just a matter of defining a
pair of variables. The only systems I know that do not speak "standard"
are some systems that claim, recursively, that they are not Unix, and on
which one does not even find ed(1) or sed(1) by default, but say:
emacs(1) and perl(1)---ed(1), the only mandatory editor for
administration because it is _line_ oriented and not 2D---curses
is 2D; so vi(1) and emacs(1) are not command line editors...)

> who can be creative and invent new things with
> something like POSIX on their back..

This does not mean ignore all POSIX (APE and Interix are examples
about POSIX on not POSIX; and the lesson is that generally, software
is not POSIX compliant, despite being developed and used on POSIX
systems, because this does not mean knowing POSIX and sticking to

But when it comes to i18n, the locales are simply a half-baked
solution, and nobody deal with them correctly---this probably means that
the solution is not the good one. When NetBSD "improved" by adding
support for numeric localization, suddenly my programs were unable to
recover from the backup ASCII representation of binary files,
because the numbers were written and read in C (with a dot as a
separator) and if the user had another locale (say french), *printf()
routines were now dealing with localization and were expecting a
comma...  even when reading a file, and not user input! (This is
something UTF-8 without localization can not address; but I claim
it works because nobody really expects that numbers read and written
from C are something else than C numbers, and I claim that Unicode
should have a different codepoint for a _numeric_ comma, precisely
to make the distinction if such a support should be added.  And I think
Unicode does not...

>   |Unicode is not something fixed, perfect, set one and forever. It
> Well, certainly not forever, but certainly for the next decades.

Hum... Plan9 had wyde runes (16 bits). Now there must be 20 or 21
bits... and the inflation goes on. When locales were separated, each
dealing with one's needs, a stabilized encoding could ignore others 
restlessly moving. When everything is put in the same encoding,
there will be always someone crying that this part of the "standard" is
not up to date and so on.

Hence the mandatory agnoticism of systems---about which we agree. Let be
a mean for almost everything, now and later, compatible with the system
language, but let user level and users deal with it. And let a firewall
protect the kernel from i18n!

        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

Home | Main Index | Thread Index | Old Index