tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: using the interfaces in ctype.h




On 21-Apr-08, at 12:09 PM, Alan Barrett wrote:

On Mon, 21 Apr 2008, Greg A. Woods; Planix, Inc. wrote:
If the implementation masked the value before using it, then it would be
unable to distinguish EOF from UCHAR_MAX (typically '\377').

Indeed, however the current implementation doesn't even try to "detect" or
"distinguish" EOF, and indeed passing EOF without casting it properly
and/or masking will result in an out-of-bounds array access in the current
implementation.

What are you smoking?  The use of constructs like

                (_ctype_ + 1)[c]

in NetBSD's implementation (both in the macros defined in ctype.h, and
in the C code defined in libc/gen/isctype.c) will access _ctype_[0] when
c == -1, and -1 happens to be the value that NetBSD used for EOF.

Ah, right, OK, sorry, my mistake. However that's really just a pedantic point irrelevant to my main argument. So, -1, which in our case happens to be EOF, is OK.

However that does nothing to help for any other negative values.

Assuming that the caller will only use -1 or a value between 0 and _CTYPE_NUM_CHARS is not safe when the implementation is accessing an array of only _CTYPE_NUM_CHARS+1 and the prototype for the API specifies a parameter of type "int". If the implementation were an inline function that could protect the array from out-of-bounds access then that would be fine, but it's not on NetBSD.

Since masking inside the
implementation would violate the requirement to distinguish EOF from
UCHAR_MAX, it's good that NetBSD doesn't do that.

Huh?  That makes no sense whatsoever.

For example (assuming 8-bit chars), if the implementation did the
equivalent of

        c = c & 0xff;

before it used the value of c, then inputs of -1 (EOF) and 0xff (a
perfectly valid unsigned char, not the same as EOF) would both be
changed to 0xff, making it impossible for the rest of the code to
distinguish between these two inputs.

Huh? In NetBSD both (_ctype_+1)[-1]==0 and (_ctype_+1)[0xff]==0 so what's to be distinguished?!?!?!?

Furthermore how's that any different than suggesting that the caller cast the parameter with "(unsigned char)" or "(int)(unsigned char)"? The cast still causes the passed value to be effectively masked with 0xFF and so even if the implementation did want to distinguish a character of 0xFF from the value of EOF it could not.

You gain far more by building the cast into the implementation rather than effectively forcing the application to employ it. At least with it built in then the application won't thwart any future or alternate implementation from detecting EOF before doing anything else with the value.

FreeBSD, OpenBSD, and Darwin all seem to have much better
implementations, though they are all using proper (inline) functions
which makes it easier in some ways to do it right.)

I am mildly curious.  In what way are they "better"?

Well they can't as easily be responsible for causing a program to crash,
for example.

You haven't shown an example, and I don't know what these other
implementations do.

If you'd like I can point you at HTTP accessible copies of the other implementations if necessary....

Anyway, I don't subscribe to the theory that it's
"better" for the implementation to go out of its way to prevent an
erroneous program from crashing; I thhik that erroneous programs deserve
to crash.
However, making it crash with a useful error message and an
abort() is more friendly than just pressing on with bad data.

I would agree entirely though I suspect there are many folks who would disagree (witness the outrage when assert() was sprinkled elsewhere about in libc). However the NetBSD implementation doesn't even try -- it just behaves naively and may then access memory outside the defined object's allocated storage. At least with a built-in mask on the array access value nothing untoward can happen.

The more expensive inline function style of implementation would afford both better ways of forcing an application to abort, as well as better ways of safely ignoring values out of range, thus offering the ideal solution to both our desire to force broken applications to crash as well as the desire of others to treat them benignly and allow them to run safely.

For my own use the built-in mask affords the latter solution transparently to applications, and without having to hack too much of the NetBSD code, so that's the way I'll go for the near term.

I recommend the following slightly more portable technique for ctype.h:

 #define _CTYPE_MASK    ~(UINT_MAX << CHAR_BIT)

I believe that that's identical to UCHAR_MAX, given the way unsigned
arithmetic works, and that UCHAR_MAX+1 is guaranteed to be equal to
1<<CHAR_BIT.

Yes it may be true that UCHAR_MAX has the same value as my mask, at least in NetBSD, but that's not how _ctype_ is defined in NetBSD. _ctype_ is defined in terms of CHAR_BIT, so the definition I chose is more readable and more logical (in my opinion, of course) than using any other unrelated constant or macro referring to an unrelated constant, and thus both the mask used to access _ctype_ and the definition of _ctype_ itself are simultaneously dependent on the same macro and independent of UCHAR_MAX. However perhaps my definition should be:

        #define _CTYPE_MASK     ~(~0UL << CHAR_BIT)

just to be pedantic and portable and to avoid any reference to any other constant.


 #define isdigit(c)     ((int)(_ctype_ + 1)[((c) & _CTYPE_MASK)] & _N))

That's just wrong, as I explained before.  Given two distinct inputs
c == EOF (0xffffffff, if int is 32 bits) and c == UCHAR_MAX (0xff, if
char is 8 bits), the results from ((c) & _CTYPE_MASK) will be 0xff in
both cases, so the macro will be unable to distinguish between the two
inputs.  OK, '\xff' doesn't happen to be a digit in any character set
that I know about, so it doesn't matter in this particular case, but
cases in which it does matter are easy to imagine.

In fact with the implementation of the NetBSD "ctype" is*() and to*() APIs, nothing outside of the proper range of ASCII is meaningful and so 0xFF is always outside the range of valid inputs.

Hang on, it's even worse than that.  The C standard allows signed
integers to have a representation other than two's complement.  The
result of (-1 & 0xff) on a one's complement machine will be 0xfe, not
0xff. NetBSD might not run on any one's complement machines, but I try
to consider them when writing code that's intended to be portable.

Your tangent about running on systems not supporting two's complement is interesting, however I think it is well outside and beyond the context of NetBSD, which I would humbly suggest will not run on any such hardware any time soon and without vast effort on both the OS side of things as well as within many applications which use the "ctype" APIs. That's a boat that sailed off and sank quite some time ago. :-)

--
                                        Greg A. Woods; Planix, Inc.
                                        <woods%planix.ca@localhost>



Home | Main Index | Thread Index | Old Index