tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



> It follows that we have exactly two possibilities.  Either we extend
> the wide character interface such that it is capable to work on
> sequences, i.e., multiple adjacent wchar_t codepoints.  Or we
> introduce a new byte-based interface, one that works with
> multi-codepoint sequences of (most likely) multibyte characters.

There is actually a third possibility: widen `char', so that it can
store a _character_, ie, an entire codepoint.

I don't expect this to be done.  But it really seems to me like the
only truly right answer.  The dissonance between `char's and the things
that hold codepoints is the reason we have octet-serialization formats
like UTF-8 to begin with.  (Which is acceptable as a serialization
format, but trying to work with it as anything else is horrible.)

Things like combining characters I would be inclined to not worry
about; it's basically the same issue we've had forever with things like
underscore-backspace-letter sequences.

For initial experiments, I would probably use 16 bits and not worry
about anything outside the BMP.  That would be enough to learn to deal
with things like the hardware's addressing granularity being finer than
C's; eventually, 32 or maybe even 24 bits might be suitable.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index