tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Proposal: _ctype_ table bitwidth change
On Mar 23, 2011, at 12:32 PM, der Mouse wrote:
> If you believe that is*() must be passed the encoding, rather than the
> codepoint, then yes, it doesn't make sense to ask what isspace()
> returns for NO-BREAK SPACE in en_US.UTF-8, because you can't pass
> NO-BREAK SPACE to it (and what it returns for 0xa0 doesn't matter,
> because that's not a valid encoding).
>
> I maintain that is*() must be passed the codepoint - the "character".
FWIW, from my admittedly rudimentary knowledge of i18n issues, is*() should be
passed the character encoded appropriately for the current locale. Obviously,
this is less than ideal for multi-byte encodings such as UTF-8, which is why
there's isw*(). However, it works fine for the single-byte encodings, and I
think if you consider how things should work in a single-byte encoding--say
iso-8859-6, it'll be clearer that is*() should be passed the encoding, rather
than the codepoint: If you've got a string encoded in 8859-6, and want to walk
through it calling isspace(), you'd just pass each octet to isspace(). You
wouldn't first run iconv() or the like on it to convert each octet to its
Unicode code point.
Also, see the definition of a character at
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_87
: "A sequence of one or more bytes representing a single graphic symbol or
control code. Note: This term corresponds to the ISO C standard term multi-byte
character, where a single-byte character is a special case of a multi-byte
character."
--
Name: Dave Huang | Mammal, mammal / their names are called /
INet: khym%azeotrope.org@localhost | they raise a paw / the bat, the cat /
FurryMUCK: Dahan | dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 35 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++
Home |
Main Index |
Thread Index |
Old Index