Re: using the interfaces in ctype.h

To: Alan Barrett <apb%cequrux.com@localhost>
Subject: Re: using the interfaces in ctype.h
From: "Greg A. Woods; Planix, Inc." <woods%planix.ca@localhost>
Date: Mon, 21 Apr 2008 13:36:33 -0400


On 21-Apr-08, at 12:09 PM, Alan Barrett wrote:

On Mon, 21 Apr 2008, Greg A. Woods; Planix, Inc. wrote:
If the implementation masked the value before using it, then itwould be
unable to distinguish EOF from UCHAR_MAX (typically '\377').
Indeed, however the current implementation doesn't even try to"detect" or
"distinguish" EOF, and indeed passing EOF without casting it properly
and/or masking will result in an out-of-bounds array access in thecurrent
implementation.
What are you smoking?  The use of constructs like

                (_ctype_ + 1)[c]

in NetBSD's implementation (both in the macros defined in ctype.h, and
in the C code defined in libc/gen/isctype.c) will access _ctype_[0]when
c == -1, and -1 happens to be the value that NetBSD used for EOF.

Ah, right, OK, sorry, my mistake. However that's really just apedantic point irrelevant to my main argument. So, -1, which in ourcase happens to be EOF, is OK.


However that does nothing to help for any other negative values.

Assuming that the caller will only use -1 or a value between 0 and_CTYPE_NUM_CHARS is not safe when the implementation is accessing anarray of only _CTYPE_NUM_CHARS+1 and the prototype for the APIspecifies a parameter of type "int". If the implementation were aninline function that could protect the array from out-of-bounds accessthen that would be fine, but it's not on NetBSD.

Since masking inside the
implementation would violate the requirement to distinguish EOF from
UCHAR_MAX, it's good that NetBSD doesn't do that.


Huh?  That makes no sense whatsoever.


For example (assuming 8-bit chars), if the implementation did the
equivalent of

        c = c & 0xff;

before it used the value of c, then inputs of -1 (EOF) and 0xff (a
perfectly valid unsigned char, not the same as EOF) would both be
changed to 0xff, making it impossible for the rest of the code to
distinguish between these two inputs.

Huh? In NetBSD both (_ctype_+1)[-1]==0 and (_ctype_+1)[0xff]==0 sowhat's to be distinguished?!?!?!?

Furthermore how's that any different than suggesting that the callercast the parameter with "(unsigned char)" or "(int)(unsigned char)"?The cast still causes the passed value to be effectively masked with0xFF and so even if the implementation did want to distinguish acharacter of 0xFF from the value of EOF it could not.

You gain far more by building the cast into the implementation ratherthan effectively forcing the application to employ it. At least withit built in then the application won't thwart any future or alternateimplementation from detecting EOF before doing anything else with thevalue.

FreeBSD, OpenBSD, and Darwin all seem to have much better
implementations, though they are all using proper (inline)functions
which makes it easier in some ways to do it right.)
I am mildly curious.  In what way are they "better"?
Well they can't as easily be responsible for causing a program tocrash,
for example.
You haven't shown an example, and I don't know what these other
implementations do.

If you'd like I can point you at HTTP accessible copies of the otherimplementations if necessary....

Anyway, I don't subscribe to the theory that it's
"better" for the implementation to go out of its way to prevent an

erroneous program from crashing; I thhik that erroneous programsdeserve

to crash.
However, making it crash with a useful error message and an
abort() is more friendly than just pressing on with bad data.

I would agree entirely though I suspect there are many folks who woulddisagree (witness the outrage when assert() was sprinkled elsewhereabout in libc). However the NetBSD implementation doesn't even try --it just behaves naively and may then access memory outside the definedobject's allocated storage. At least with a built-in mask on thearray access value nothing untoward can happen.

The more expensive inline function style of implementation wouldafford both better ways of forcing an application to abort, as well asbetter ways of safely ignoring values out of range, thus offering theideal solution to both our desire to force broken applications tocrash as well as the desire of others to treat them benignly and allowthem to run safely.

For my own use the built-in mask affords the latter solutiontransparently to applications, and without having to hack too much ofthe NetBSD code, so that's the way I'll go for the near term.

I recommend the following slightly more portable technique forctype.h:
 #define _CTYPE_MASK    ~(UINT_MAX << CHAR_BIT)


I believe that that's identical to UCHAR_MAX, given the way unsigned
arithmetic works, and that UCHAR_MAX+1 is guaranteed to be equal to
1<<CHAR_BIT.

Yes it may be true that UCHAR_MAX has the same value as my mask, atleast in NetBSD, but that's not how _ctype_ is defined in NetBSD._ctype_ is defined in terms of CHAR_BIT, so the definition I chose ismore readable and more logical (in my opinion, of course) than usingany other unrelated constant or macro referring to an unrelatedconstant, and thus both the mask used to access _ctype_ and thedefinition of _ctype_ itself are simultaneously dependent on the samemacro and independent of UCHAR_MAX. However perhaps my definitionshould be:


        #define _CTYPE_MASK     ~(~0UL << CHAR_BIT)

just to be pedantic and portable and to avoid any reference to anyother constant.

 #define isdigit(c)     ((int)(_ctype_ + 1)[((c) & _CTYPE_MASK)] & _N))


That's just wrong, as I explained before.  Given two distinct inputs
c == EOF (0xffffffff, if int is 32 bits) and c == UCHAR_MAX (0xff, if
char is 8 bits), the results from ((c) & _CTYPE_MASK) will be 0xff in
both cases, so the macro will be unable to distinguish between the two
inputs.  OK, '\xff' doesn't happen to be a digit in any character set
that I know about, so it doesn't matter in this particular case, but
cases in which it does matter are easy to imagine.

In fact with the implementation of the NetBSD "ctype" is*() and to*()APIs, nothing outside of the proper range of ASCII is meaningful andso 0xFF is always outside the range of valid inputs.

Hang on, it's even worse than that.  The C standard allows signed
integers to have a representation other than two's complement.  The
result of (-1 & 0xff) on a one's complement machine will be 0xfe, not

0xff. NetBSD might not run on any one's complement machines, but Itry

to consider them when writing code that's intended to be portable.

Your tangent about running on systems not supporting two's complementis interesting, however I think it is well outside and beyond thecontext of NetBSD, which I would humbly suggest will not run on anysuch hardware any time soon and without vast effort on both the OSside of things as well as within many applications which use the"ctype" APIs. That's a boat that sailed off and sank quite some timeago. :-)


--
                                        Greg A. Woods; Planix, Inc.
                                        <woods%planix.ca@localhost>

Follow-Ups:
- Re: using the interfaces in ctype.h
  - From: der Mouse
- Re: using the interfaces in ctype.h
  - From: Joerg Sonnenberger

References:
- Re: using the interfaces in ctype.h
  - From: Christos Zoulas
- Re: using the interfaces in ctype.h
  - From: Greg A. Woods; Planix, Inc.
- Re: using the interfaces in ctype.h
  - From: Terry Moore
- Re: using the interfaces in ctype.h
  - From: Greg A. Woods; Planix, Inc.
- Re: using the interfaces in ctype.h
  - From: Alan Barrett
- Re: using the interfaces in ctype.h
  - From: Greg A. Woods; Planix, Inc.
- Re: using the interfaces in ctype.h
  - From: Alan Barrett

Prev by Date: Re: using the interfaces in ctype.h
Next by Date: Re: using the interfaces in ctype.h
Previous by Thread: Re: using the interfaces in ctype.h
Next by Thread: Re: using the interfaces in ctype.h
Indexes:

Home | Main Index | Thread Index | Old Index