[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: A draft for a multibyte and multi-codepoint C string interface
|On Wed, Apr 17, 2013 at 01:31:43PM +0200, Steffen Daode Nurpmeso wrote:
|> But Unicode collation is somewhat locale-specific, in permanent
|> transition and very complicated.
|This is one supplementary reason to not deal with this in the kernel.
|(There are other fundamental reasons.)
|> Plan9 simply doesn't know about
|> locales at all (?) and gives a s..t about any standards.
|UTF-8 is a standard, and comes from Plan9 ;) Furthermore, Plan9 speaks
|POSIX with the APE environment (the very same appears with Windows:
..you like it?
|Interix speaks POSIX as an environment too---I added Interix support in
|my RISK framework for kerTeX, and it was just a matter of defining a
..lost contact with Windows 95B.. I have no idea, 'did only know
about Cygwin. „Windows Services“ would be the better way 'round.
|pair of variables. The only systems I know that do not speak "standard"
|are some systems that claim, recursively, that they are not Unix, and on
|which one does not even find ed(1) or sed(1) by default, but say:
|emacs(1) and perl(1)---ed(1), the only mandatory editor for
|administration because it is _line_ oriented and not 2D---curses
|is 2D; so vi(1) and emacs(1) are not command line editors...)
too hardcore for me. (I never forget reading in an errno
definition list info page, Linux, must be year ~2000: „EED - the
experienced user knows what is wrong“ -- but at least i do know
what ed(1) is, today.)
You do need a mouse for Acme - that is 2x 2D, then.
(I'm waiting for the mandatory brain transplant, that will also
allow anything-by-thinking as a bonus for the number, then.
Until then i'll type it, but visually. Sorry :)
|> who can be creative and invent new things with
|> something like POSIX on their back..
|This does not mean ignore all POSIX (APE and Interix are examples
|about POSIX on not POSIX; and the lesson is that generally, software
|is not POSIX compliant, despite being developed and used on POSIX
|systems, because this does not mean knowing POSIX and sticking to
Well, it seems many of those Unix people which are today were
before POSIX came in sight.
I personally have only one interest in respect to standards, and
that is portable usability, because that *would* allow me to focus
on what i want to do.
|But when it comes to i18n, the locales are simply a half-baked
|solution, and nobody deal with them correctly---this probably means that
|the solution is not the good one. When NetBSD "improved" by adding
I think it is mostly „setlocale(LC_ALL..)“ and assume the rest is
automatic. But i'm an unfriendly person.
|support for numeric localization, suddenly my programs were unable to
|recover from the backup ASCII representation of binary files,
|because the numbers were written and read in C (with a dot as a
|separator) and if the user had another locale (say french), *printf()
|routines were now dealing with localization and were expecting a
|comma... even when reading a file, and not user input! (This is
|something UTF-8 without localization can not address; but I claim
|it works because nobody really expects that numbers read and written
|from C are something else than C numbers, and I claim that Unicode
|should have a different codepoint for a _numeric_ comma, precisely
|to make the distinction if such a support should be added. And I think
|Unicode does not...
I don't know what you mean by that.
Unicode offers a lot of properties and special characters, since
it also targets typesetting me thinks.
But a few weeks ago there was an interesting thread on a Unicode
list (start: ) about how integers (numbers) should or could be
parsed, since Unicode offers so many combinations via combining:
European digits (U+0030 to U+0039) may, since Unicode 6.1.0, be
used with variation selectors. As their primary purpose is for use
with u+20E3 COMBINING ENCLOSING KEYCAP, is it legitimate to fail
to recognise strings of digits with variation selectors as
This is a can of worms.
I don't think that a generic string interface should address that.
It could maybe offer some „put_any_effort_in_parse_“strtol()
thing, but which must be called explicitly.
But i have made one real experience, and that is „what you loose
on a low layer can never be fixed in a higher one“.
So an extended utility library would have to completely reinvent
the wheel to implement the mentioned function above unless the
lower library offers comprehensive access. E.g., take iconv(3);
it is rather clear that iconv(3) uses an UCS4 / UTF-32
intermediate layer, but there is no way to access it. You'd have
to use two iconv(3) objects, one for source->UCS4, and one for
UCS4->target, and pray it's possible if one explicitly. That is
just bad design.
|>|Unicode is not something fixed, perfect, set one and forever. It
|> Well, certainly not forever, but certainly for the next decades.
|Hum... Plan9 had wyde runes (16 bits). Now there must be 20 or 21
|bits... and the inflation goes on. When locales were separated, each
The 16 bit restriction was a conscious decision, right? ISO 10646
was 31 bit from the very start. (But every carefully crafted
program should „simply“ scale due to the rune-max constant. But
bytes are 8 bit fixed.)
Main Index |
Thread Index |