tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Proposal: _ctype_ table bitwidth change



On Mar 23, 2011, at 12:32 PM, der Mouse wrote:

> If you believe that is*() must be passed the encoding, rather than the
> codepoint, then yes, it doesn't make sense to ask what isspace()
> returns for NO-BREAK SPACE in en_US.UTF-8, because you can't pass
> NO-BREAK SPACE to it (and what it returns for 0xa0 doesn't matter,
> because that's not a valid encoding).
> 
> I maintain that is*() must be passed the codepoint - the "character".


FWIW, from my admittedly rudimentary knowledge of i18n issues, is*() should be 
passed the character encoded appropriately for the current locale. Obviously, 
this is less than ideal for multi-byte encodings such as UTF-8, which is why 
there's isw*(). However, it works fine for the single-byte encodings, and I 
think if you consider how things should work in a single-byte encoding--say 
iso-8859-6, it'll be clearer that is*() should be passed the encoding, rather 
than the codepoint: If you've got a string encoded in 8859-6, and want to walk 
through it calling isspace(), you'd just pass each octet to isspace(). You 
wouldn't first run iconv() or the like on it to convert each octet to its 
Unicode code point.

Also, see the definition of a character at 
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_87
 : "A sequence of one or more bytes representing a single graphic symbol or 
control code. Note: This term corresponds to the ISO C standard term multi-byte 
character, where a single-byte character is a special case of a multi-byte 
character."
-- 
Name: Dave Huang         |  Mammal, mammal / their names are called /
INet: khym%azeotrope.org@localhost |  they raise a paw / the bat, the cat /
FurryMUCK: Dahan         |  dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 35 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++



Home | Main Index | Thread Index | Old Index