Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Sat, 30 Mar 2013 20:57:45 -0400 (EDT)

> It follows that we have exactly two possibilities.  Either we extend
> the wide character interface such that it is capable to work on
> sequences, i.e., multiple adjacent wchar_t codepoints.  Or we
> introduce a new byte-based interface, one that works with
> multi-codepoint sequences of (most likely) multibyte characters.

There is actually a third possibility: widen `char', so that it can
store a _character_, ie, an entire codepoint.

I don't expect this to be done.  But it really seems to me like the
only truly right answer.  The dissonance between `char's and the things
that hold codepoints is the reason we have octet-serialization formats
like UTF-8 to begin with.  (Which is acceptable as a serialization
format, but trying to work with it as anything else is horrible.)

Things like combining characters I would be inclined to not worry
about; it's basically the same issue we've had forever with things like
underscore-backspace-letter sequences.

For initial experiments, I would probably use 16 bits and not worry
about anything outside the BMP.  That would be enough to learn to deal
with things like the hardware's addressing granularity being finer than
C's; eventually, 32 or maybe even 24 bits might be suitable.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode

References:
- A draft for a multibyte and multi-codepoint C string interface
  - From: Daode

Prev by Date: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index