tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



> But when it comes to i18n, the locales are simply a half-baked
> solution, and nobody deal with them correctly---this probably means
> that the solution is not the good one.

Locales as they are currently used are not a good solution to much of
anything.  They are a tool, which can be used to build soloutions to
those problems, but so far hasn't been.

> When NetBSD "improved" by adding support for numeric localization,
> suddenly my programs were unable to recover from the backup ASCII
> representation of binary files, because the numbers were written and
> read in C (with a dot as a separator) and if the user had another
> locale (say french), *printf() routines were now dealing with
> localization and were expecting a comma...

This is an example of what I mean.  Correct use of the `locale' tool is
to use the user's locale when interacting with the user, but not when
doing software-to-software interfaces (such as saved-data files).  Yes,
this means switching locales depending on what you're doing at the
moment.  Using locales right does.  Simply switching all I/O based on
the user's locale is right only when the only text-formatted I/O you do
is user interaction.

This is how locales were done wrong: they were silently added to all
I/O done by locale-unaware programs, leading to trouble such as yours,
rather than being provided as a tool that doesn't affect programs
unless they specifically use it.  While I know of no research on the
question, I suspect that the damage done by imposing I/O mangling on
unsuspecting programs outweighs the benefit to user-interaction-only
programs.  If nothing else, users are - usually! - more capable of
adapting to small errors than software is.

> [...], and I claim that Unicode should have a different codepoint for
> a _numeric_ comma, [...]

I'm not sure.  Not every locale uses commas for that, so either you're
actually inventing a radix grouping codepoint whose rendering depends
on locale (comma, dot, space, whatever), or you're drawing a
distinction between punctuation comma and numeric comma that, while
possibly useful for locales that use commas in numbers, doesn't have
much to do with other locales.  (And "numeric comma" doesn't even have
clear semantics attached to it; in some locales, it's a thousands
separator, while in others, it marks the boundary between integer part
and fractional part.  Yet others don't use it at all; I've seen space
used for thousands separators and dot for radix point sometimes.  And
then there are locales that group digits in chunks of four rather than
three, that is, which want a ten-thousands separator rather than a
thousands separator.)

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index