tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Proposal: _ctype_ table bitwidth change

> 0xa0 as Unicode Code Point is not representable as unsigned char with
> UTF-8 encoding.

That doesn't even make sense.

UTF-8 takes Unicode codepoints and produces not octets but sequences of
octets.  The Unicode code point 0xa0 is representable, as a codepoint,
as unsigned char; in this it is no different from any other integer in
the range 0..255.  It is representable as an octet sequence via
encodings such as UTF-8.  These two concepts should not be confused.

If you prefer, you can think of an unsigned char holding 0xa0 as an
octet sequence of length 1.  That octet sequence is not a valid UTF-8
encoding sequence (though it can be part of one), but that does not
make the isolated octet any less a perfectly good way of storing the
integer 160, even if that 160 is conceptually a Unicode codepoint.

This is really no different from the way the number 18 can be stored as
the bit sequence 00000000000000000000000000010010 or the bit sequence
different representations for different uses.

> 0xa0 is not a valid space character, since it is not a valid
> character by itself.

Certainly it is (though it is not by itself a valid _UTF-8 encoding of_
a character).  0xa0 is 160.  Unicode codepoint 160 is NON-BREAK SPACE;
I can't see how this could _not_ be a space character.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML      
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Home | Main Index | Thread Index | Old Index