Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: Steffen "Daode" Nurpmeso <sdaoden%gmail.com@localhost>
Date: Sun, 31 Mar 2013 17:44:38 +0200

Hello,

Mouse <mouse%Rodents-Montreal.ORG@localhost> wrote:
 |> It follows that we have exactly two possibilities.  Either we extend
 |> the wide character interface such that it is capable to work on
 |> sequences, i.e., multiple adjacent wchar_t codepoints.  Or we
 |> introduce a new byte-based interface, one that works with
 |> multi-codepoint sequences of (most likely) multibyte characters.
 |
 |There is actually a third possibility: widen `char', so that it can
 |store a _character_, ie, an entire codepoint.
 |
 |I don't expect this to be done.  But it really seems to me like the

mmmh, i think i'll be working on that some more in the future.

 |only truly right answer.  The dissonance between `char's and the things
 |that hold codepoints is the reason we have octet-serialization formats
 |like UTF-8 to begin with.  (Which is acceptable as a serialization
 |format, but trying to work with it as anything else is horrible.)

That is very true, and i really hate to give up the 1:1 relation
of array-index <-> character, but it is no longer possible to do
these kind of things anyway; you have to sequentially work
a string from the front to the back, maximally being able to keep
some intermediate safe jump points you can unroll to.
If that is true, the actual storage format is rather
uninteresting; and UTF-8 is variable-length and
self-synchronizing, so that with looking at the first byte and
assuring sufficient length, you could jump over an entire
sequence.  That is just one step more than what is necessary to
act on wchar_t*.

 |
 |Things like combining characters I would be inclined to not worry
 |about; it's basically the same issue we've had forever with things like
 |underscore-backspace-letter sequences.

It is necessary to deal with these sequences for at least
comparison purposes.
E.g., in the terminal on Mac OS X tab-completion and file-globbing
doesn't work for files on Apple-managed filesystems, because the
names are normalized, and neither the Apple-supplies bash and ksh,
nor the mksh i'm running get that right.

 |For initial experiments, I would probably use 16 bits and not worry
 |about anything outside the BMP.  That would be enough to learn to deal
 |with things like the hardware's addressing granularity being finer than
 |C's; eventually, 32 or maybe even 24 bits might be suitable.

I think wide-enabled ncurses uses an array of five such codepoints
to represent a single visual cell.
The really good thing about that new interface would be that you
always pass in the pointer to a buffer, so that, different to what
Plan9 did and does, you don't have to change or extend the type
and storage requirement should the need arise.  You simply keep on
passing user input, like a line that has been read, through.

 |/~\ The ASCII                           Mouse
 |\ / Ribbon Campaign
 | X  Against HTML              mouse%rodents-montreal.org@localhost
 |/ \ Email!         7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Thanks, and more Happy Easter to come,

--steffen

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse

References:
- A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse

Prev by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index