tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Proposal: _ctype_ table bitwidth change



> NO-BREAK SPACE, which is 0xC2 0xA0 in en_US.UTF-8, obviously falls
> into the some-codepoints-that-can't-fit-in-unsigned-char category.

It's certainly not obvious to me.  That is not the codepoint but the
encoding of the codepoint.  The codepoint is the abstract integer 160,
which _does_ fit into unsigned char.

This is why I was drawing a distinction between codepoints and
encodings - serializations - of codepoints upthread.  In traditional
8-bit character sets, every codepoint has a one-octet encoding with the
trivial mapping between the codepoint and the value of that octet, so
the distinction is easy to lose track of.  But it's an important
distinction when dealing with charsets like Unicode and encodings which
(like UTF-8 and UTF-7) do not have that trivial a mapping between
codepoints and encodings.

If you believe that is*() must be passed the encoding, rather than the
codepoint, then yes, it doesn't make sense to ask what isspace()
returns for NO-BREAK SPACE in en_US.UTF-8, because you can't pass
NO-BREAK SPACE to it (and what it returns for 0xa0 doesn't matter,
because that's not a valid encoding).

I maintain that is*() must be passed the codepoint - the "character".

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index