tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



tlaronde%polynum.com@localhost wrote:
 |On Wed, Apr 17, 2013 at 01:31:43PM +0200, Steffen Daode Nurpmeso wrote:
 |>  
 |> But Unicode collation is somewhat locale-specific, in permanent
 |> transition and very complicated.
 |
 |This is one supplementary reason to not deal with this in the kernel.
 |(There are other fundamental reasons.)
 |
 |> Plan9 simply doesn't know about
 |> locales at all (?) and gives a s..t about any standards. 
 | 
 |UTF-8 is a standard, and comes from Plan9 ;) Furthermore, Plan9 speaks
 |POSIX with the APE environment (the very same appears with Windows:

..you like it?

 |Interix speaks POSIX as an environment too---I added Interix support in
 |my RISK framework for kerTeX, and it was just a matter of defining a

..lost contact with Windows 95B..  I have no idea, 'did only know
about Cygwin.  „Windows Services“ would be the better way 'round.

 |pair of variables. The only systems I know that do not speak "standard"
 |are some systems that claim, recursively, that they are not Unix, and on
 |which one does not even find ed(1) or sed(1) by default, but say:
 |emacs(1) and perl(1)---ed(1), the only mandatory editor for
 |administration because it is _line_ oriented and not 2D---curses
 |is 2D; so vi(1) and emacs(1) are not command line editors...)

too hardcore for me.  (I never forget reading in an errno
definition list info page, Linux, must be year ~2000: „EED - the
experienced user knows what is wrong“ -- but at least i do know
what ed(1) is, today.)
You do need a mouse for Acme - that is 2x 2D, then.
(I'm waiting for the mandatory brain transplant, that will also
allow anything-by-thinking as a bonus for the number, then.
Until then i'll type it, but visually.  Sorry :)

 |> who can be creative and invent new things with
 |> something like POSIX on their back..
 |
 |This does not mean ignore all POSIX (APE and Interix are examples
 |about POSIX on not POSIX; and the lesson is that generally, software
 |is not POSIX compliant, despite being developed and used on POSIX
 |systems, because this does not mean knowing POSIX and sticking to
 |POSIX...).

Well, it seems many of those Unix people which are today were
before POSIX came in sight.
I personally have only one interest in respect to standards, and
that is portable usability, because that *would* allow me to focus
on what i want to do.

 |But when it comes to i18n, the locales are simply a half-baked
 |solution, and nobody deal with them correctly---this probably means that
 |the solution is not the good one. When NetBSD "improved" by adding

I think it is mostly „setlocale(LC_ALL..)“ and assume the rest is
automatic.  But i'm an unfriendly person.

 |support for numeric localization, suddenly my programs were unable to
 |recover from the backup ASCII representation of binary files,
 |because the numbers were written and read in C (with a dot as a
 |separator) and if the user had another locale (say french), *printf()
 |routines were now dealing with localization and were expecting a
 |comma...  even when reading a file, and not user input! (This is
 |something UTF-8 without localization can not address; but I claim
 |it works because nobody really expects that numbers read and written
 |from C are something else than C numbers, and I claim that Unicode
 |should have a different codepoint for a _numeric_ comma, precisely
 |to make the distinction if such a support should be added.  And I think
 |Unicode does not...

I don't know what you mean by that.
Unicode offers a lot of properties and special characters, since
it also targets typesetting me thinks.

But a few weeks ago there was an interesting thread on a Unicode
list (start: [1]) about how integers (numbers) should or could be
parsed, since Unicode offers so many combinations via combining:

  European digits (U+0030 to U+0039) may, since Unicode 6.1.0, be
  used with variation selectors. As their primary purpose is for use
  with u+20E3 COMBINING ENCLOSING KEYCAP, is it legitimate to fail
  to recognise strings of digits with variation selectors as
  representing numbers?

This is a can of worms.
I don't think that a generic string interface should address that.
It could maybe offer some „put_any_effort_in_parse_“strtol()
thing, but which must be called explicitly.

But i have made one real experience, and that is „what you loose
on a low layer can never be fixed in a higher one“.
So an extended utility library would have to completely reinvent
the wheel to implement the mentioned function above unless the
lower library offers comprehensive access.  E.g., take iconv(3);
it is rather clear that iconv(3) uses an UCS4 / UTF-32
intermediate layer, but there is no way to access it.  You'd have
to use two iconv(3) objects, one for source->UCS4, and one for
UCS4->target, and pray it's possible if one explicitly.  That is
just bad design.

 |>|Unicode is not something fixed, perfect, set one and forever. It
 |> 
 |> Well, certainly not forever, but certainly for the next decades.
 |
 |Hum... Plan9 had wyde runes (16 bits). Now there must be 20 or 21
 |bits... and the inflation goes on. When locales were separated, each

The 16 bit restriction was a conscious decision, right?  ISO 10646
was 31 bit from the very start.  (But every carefully crafted
program should „simply“ scale due to the rune-max constant.  But
bytes are 8 bit fixed.)

[1] <http://www.unicode.org/mail-arch/unicode-ml/y2013-m03/0101.html>

--steffen



Home | Main Index | Thread Index | Old Index