Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%netbsd.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: tlaronde%polynum.com@localhost
Date: Thu, 18 Apr 2013 09:08:33 +0200

On Wed, Apr 17, 2013 at 08:55:04PM +0200, Steffen Daode Nurpmeso wrote:
>  | 
>  |UTF-8 is a standard, and comes from Plan9 ;) Furthermore, Plan9 speaks
>  |POSIX with the APE environment (the very same appears with Windows:
> 
> ..you like it?
> 

As long as I do not develop for Windows, I like the fact that having to
deal with it (for others: TeX users on Windows) it was straightforward
to add it as a POSIX target (the aim being, finally, since TeX is 
C89---minus some scripts called by MetaPost that are system
interpreter dependent--- to compile for the windows32 subsystem,
and not the POSIX one).

The same can not be said about others, supposed to have Unix like 
systems (even if they claim this is not the case), bashing (sic!)
others about standard compliance and not being standard compliants...

> You do need a mouse for Acme - that is 2x 2D, then.

Absolutely. As long as an interface is 2D, there is no reason to have
curses instead of a graphical interface allowing mouse use. If you have
no 2D (this is the case with some servers, CPU or filesystem), one needs
an 1D editor. This is typically a thing new users of Plan9 do not
understand, searching for more (when windows have scrolling) or a curses
like editor...

> 
> I personally have only one interest in respect to standards, and
> that is portable usability, because that *would* allow me to focus
> on what i want to do.
> 

I have also portability in sight, but this means two actions: first,
know what I'm using and split code between portable (for the
language): for C this is C89 (moving C99); and system dependent.

And when I want portability, I focus on the minimal set, not the
extensions and certainly not the fuzzy! When it comes to glyphes,
the huge easy step is to allow the user to nickname software
resources (files) using a wider range of numbers (rendered as glyphes)
without loosing sight about what a system is. UTF-8 allows this,
by being octets, C strings.

And the system should stop there: it allows a wide range of glyphes. But
it does not care.

The C *printf(), *scanf() functions should be locale independent,
because they speak C: when piping between programs, this is the program
language that has to be spoken not fine points of typography.

When coming to fine points of typography, semantics and so on, this is a
higher level and for numbers, if a french user wants to render numbers
with a comma instead of a dot (2.18 is typographed normally 2,18 in 
french) this does impact _text_ files and does imply specialized tools
or a custom interface.

Where is the problem? Unicode has not unified the elementary glyphes
(without ligatures, without controls except for the normal legacy ASCII,
latin1 etc.) but has created a mess. Unicode is not a string of glyphes,
it is a language of programmation!

> 
> But a few weeks ago there was an interesting thread on a Unicode
> list (start: [1]) about how integers (numbers) should or could be
> parsed, since Unicode offers so many combinations via combining:
> 
>   European digits (U+0030 to U+0039) may, since Unicode 6.1.0, be
>   used with variation selectors. As their primary purpose is for use
>   with u+20E3 COMBINING ENCLOSING KEYCAP, is it legitimate to fail
>   to recognise strings of digits with variation selectors as
>   representing numbers?
> 
> This is a can of worms.

Yes. The engineering axiom is to keep the kernel and the OS far from 
this, since this use of codepoint as controls is embedding non
glyphes and spoiling the strings for every use, not only in a
typographical document.

This is crazy!

> 
> The 16 bit restriction was a conscious decision, right?  ISO 10646
> was 31 bit from the very start.  (But every carefully crafted
> program should ?simply? scale due to the rune-max constant.  But
> bytes are 8 bit fixed.)
> 

Are there really people who want to switch to "Unicode" and not
UTF-8, that is to impose a---for the moment--tetra "character" (big
Endian? Little Endian?) waiting for the need of an octa?

-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse

References:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Thor Lancelot Simon
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode

Prev by Date: Re: CVS commit: src/lib/libc/locale
Next by Date: Re: CVS commit: src
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index