Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%netbsd.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: Steffen "Daode" Nurpmeso <sdaoden%gmail.com@localhost>
Date: Wed, 17 Apr 2013 13:31:43 +0200

tlaronde%polynum.com@localhost wrote:
  |On Mon, Apr 15, 2013 at 05:51:33PM -0400, James K. Lowden wrote:
  |> I'm interested in useability. 
  |
  |To drop codepages, and to mandate that, on the user level, UTF-8
  |is the rule (bye-bye localization and so on) allows this. But this
  |has been done for Plan9 and this should be considered a reference.
  |Specifically, what has be done, and what has not be done.
 
But Unicode collation is somewhat locale-specific, in permanent
transition and very complicated.  Plan9 simply doesn't know about
locales at all (?) and gives a s..t about any standards.  I think
they're right, who can be creative and invent new things with
something like POSIX on their back..

  |What would be a great step, is the ability for the text utilities
  |to deal with UTF-8. But this is in user space, because it is where
  |it belongs.

I totally agree -- using a round-trip with wchar_t for file I/O
just to be able to get access to proper character classification
is just a terrible thing to do; at least unless you store the
files in UTF-32 (or, 0xDEADBEEF!, UTF-16), but which is just as
terrible.

  |Furthermore, Unicode is kind of a mess (for example, instead of

It's a pretty closed community, with a lot lot of political and
economical interest noise.  Errors happened (they admit that), but
the promise of stability etc. is also a reason for some rough
edges.  But, you know -- i'm *far* from being an expert with all
the linguistic problems etc. that these linguists had to deal
with.  They surely would do things a bit different if they could
start anew from scratch…

  |Unicode is not something fixed, perfect, set one and forever. It

Well, certainly not forever, but certainly for the next decades.
Where, except for ISO 10646, but which is almost brought into line
with Unicode, is the approved intellectual power to deal with the
worlds languages?  And who is willing to spend the money for
something new.  And why?  Certainly for the next decades.

  |that could be adapted to whatever encoding appears later. Keep the
  |OS clean with nul terminated octet strings. Enhance utilities to
  |deal with UTF-8 (and not runes for exchange; runes are only
  |internal).
  |And keep it agnostic about the meaning of the glyphes.

I agree.

--steffen

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde

References:
- A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Thor Lancelot Simon
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde

Prev by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index