tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Proposal: _ctype_ table bitwidth change



> > The most important point is that is* functions accept an octet, not a
> > code point.
> 
> They do?  Where is this defined?
> 
> Historically, it has been false: is*() has been documented to accept
> "characters", which I can't read as anything but codepoints.
> 
> That some charsets have some codepoints that can't fit in unsigned char
> (at least when, as on NetBSD, unsigned char is just one octet) just
> means that is*() aren't useful for more than just 256 of their possible
> codepoints, not that they somehow get retconned to take just one octet
> of a storage encoding of a codepoint.
> 
> At least, that's how I read it.  Is there a spec somewhere which spells
> this out precisely?

As far as I know, there is no explicit description.

However, to begin with, ISO C doesn't define the concept of like "codepoint."
It defines only two representation; "(single-byte/multibyte) character" and
"wide character".
I wonder how is* functions are affected by undefined concept.

In addition, ISO C contains the part implying that is* functions accept
an "octet".

7.25.2.1 Wide character classification functions:

  Each of the following functions (note: isw* functions) returns true
  for each wide character that corresponds (as if by a call to the wctob
  function) to a single-byte character for which the corresponding
  character classification function (note: is* functions) from 7.4.1
  returns true, except that the iswgraph and iswpunct functions may
  differ with respect to wide characters other than L' ' that are both
  printing and white-space wide characters.

  ('note' is inserted by me.)

Note that this part was added at revision in 1995 (C95).
ISO C seems to contain some ambiguity about "character,"
especially in the part that has been existing since 1989 (C89).


---
Takuya SHIOZAKI


Home | Main Index | Thread Index | Old Index