tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Proposal: _ctype_ table bitwidth change
Hi,
> On Tue, Mar 22, 2011 at 04:35:10AM +0900, Takehiko NOZAKI wrote:
> > > As I wrote earlier, IMO the correct approach is to make the rune table
> > > the public interface. Drop the current _CTYPE_* macros for anything but
> > > legacy purposes. Drop them completely after the next major bump.
> > > chrtbl is dead already and I plan to remove the rest of the libc code
> > > soonish, it just complicated this without any real gain.
> > >
> >
> > no, _ctype_(for is*) and rune(for isw*) *must* be separated, example:
> >
> > #include <ctype.h>
> > #include <locale.h>
> > #include <stdio.h>
> > #include <wchar.h>
> > #include <wctype.h>
> >
> > int
> > main(void)
> > {
> > setlocale(LC_ALL, "en_US.UTF-8");
> > printf("isspace:%d\n", isspace((unsigned char)0xA0));
> > printf("iswspace:%d\n", iswspace((wchar_t)0xA0));
> > }
> >
> > this code print:
> >
> > isspace:0
> > iswspace:1
> >
> > apparently ctype table and wctype table *differ*.
>
> Yes, but that doesn't mean they can't use the same format. The problem
> in this case is that 0xa0 is not a valid UTF8 sequence by itself.
ISO C99 says:
In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the
macro EOF. If the argument has any other value, the behavior is
undefined.
Here,
- 0xa0 is representable as an unsigned char, and
- 0xa0 is not a space character.
Thus, to conform to the standard, the behavior of isspace(0xa0) should
be defined and it should return 0, even if 0xa0 is not a valid character.
To implement such behavior, we can take one of some ways:
1. use the individual table only for is*:
#define isspace(c) (_sb_ctype_[c+1] & _S)
2. introduce "multibyte fragment flag" to the integrated ctype table:
#define isspace(c) ((_ctype_[c+1] & (_F|_S)) == _S)
3. make sure is* to check whther the argument is a complete character:
#define isspace(c) ((_ctype_[c+1] & _S) && (c!=EOF) && (btowc(c)!=WEOF))
I prefer #1 because it keeps ABI simple.
---
Takuya SHIOZAKI
Home |
Main Index |
Thread Index |
Old Index