Re: using the interfaces in ctype.h

To: NetBSD Userlevel Technical discussion list <tech-userlevel%NetBSD.org@localhost>
Subject: Re: using the interfaces in ctype.h
From: Alan Barrett <apb%cequrux.com@localhost>
Date: Mon, 21 Apr 2008 18:09:07 +0200

On Mon, 21 Apr 2008, Greg A. Woods; Planix, Inc. wrote:
>> If the implementation masked the value before using it, then it would be
>> unable to distinguish EOF from UCHAR_MAX (typically '\377').
>
> Indeed, however the current implementation doesn't even try to "detect" or 
> "distinguish" EOF, and indeed passing EOF without casting it properly 
> and/or masking will result in an out-of-bounds array access in the current 
> implementation.

What are you smoking?  The use of constructs like

                (_ctype_ + 1)[c] 

in NetBSD's implementation (both in the macros defined in ctype.h, and
in the C code defined in libc/gen/isctype.c) will access _ctype_[0] when
c == -1, and -1 happens to be the value that NetBSD used for EOF.

>> Since masking inside the
>> implementation would violate the requirement to distinguish EOF from
>> UCHAR_MAX, it's good that NetBSD doesn't do that.
>
> Huh?  That makes no sense whatsoever.

For example (assuming 8-bit chars), if the implementation did the
equivalent of

        c = c & 0xff;

before it used the value of c, then inputs of -1 (EOF) and 0xff (a
perfectly valid unsigned char, not the same as EOF) would both be
changed to 0xff, making it impossible for the rest of the code to
distinguish between these two inputs.

> What do you think the current NetBSD implementation does when given
> EOF anyway?  How about when it's given a signed integer variable that
> has been assigned the value of EOF?

In both cases, I think it does the right thing: isupper(-1) returns 0,
isalpha(-1) returns 0, etc.

> How about any other negative number which some user might have thought
> to be a useful way of extending the error reporting possible in such a
> situation?

The implementation would perform an out of bounds array access, which
leads to undefined behaviour.  This is perectly fine; the caller invoked
undefined behaviour by calling isfoo() with in invalid value, so they
deserve to have their program crash, or continue with incorrect results,
or anything.

>>> FreeBSD, OpenBSD, and Darwin all seem to have much better
>>> implementations, though they are all using proper (inline) functions
>>> which makes it easier in some ways to do it right.)
>>
>> I am mildly curious.  In what way are they "better"?
>
> Well they can't as easily be responsible for causing a program to crash, 
> for example.

You haven't shown an example, and I don't know what these other
implementations do.  Anyway, I don't subscribe to the theory that it's
"better" for the implementation to go out of its way to prevent an
erroneous program from crashing; I thhik that erroneous programs deserve
to crash.  However, making it crash with a useful error message and an
abort() is more friendly than just pressing on with bad data.

> I recommend the following slightly more portable technique for ctype.h:
>
>   #define _CTYPE_MASK ~(UINT_MAX << CHAR_BIT)

I believe that that's identical to UCHAR_MAX, given the way unsigned
arithmetic works, and that UCHAR_MAX+1 is guaranteed to be equal to
1<<CHAR_BIT.

>   #define isdigit(c)  ((int)(_ctype_ + 1)[((c) & _CTYPE_MASK)] & _N))

That's just wrong, as I explained before.  Given two distinct inputs
c == EOF (0xffffffff, if int is 32 bits) and c == UCHAR_MAX (0xff, if
char is 8 bits), the results from ((c) & _CTYPE_MASK) will be 0xff in
both cases, so the macro will be unable to distinguish between the two
inputs.  OK, '\xff' doesn't happen to be a digit in any character set
that I know about, so it doesn't matter in this particular case, but
cases in which it does matter are easy to imagine.

Hang on, it's even worse than that.  The C standard allows signed
integers to have a representation other than two's complement.  The
result of (-1 & 0xff) on a one's complement machine will be 0xfe, not
0xff.  NetBSD might not run on any one's complement machines, but I try
to consider them when writing code that's intended to be portable.

--apb (Alan Barrett)

Follow-Ups:
- Re: using the interfaces in ctype.h
  - From: Greg A. Woods; Planix, Inc.
- Re: using the interfaces in ctype.h
  - From: Joerg Sonnenberger

References:
- Re: using the interfaces in ctype.h
  - From: Christos Zoulas
- Re: using the interfaces in ctype.h
  - From: Greg A. Woods; Planix, Inc.
- Re: using the interfaces in ctype.h
  - From: Terry Moore
- Re: using the interfaces in ctype.h
  - From: Greg A. Woods; Planix, Inc.
- Re: using the interfaces in ctype.h
  - From: Alan Barrett
- Re: using the interfaces in ctype.h
  - From: Greg A. Woods; Planix, Inc.

Prev by Date: Re: using the interfaces in ctype.h
Next by Date: Re: using the interfaces in ctype.h
Previous by Thread: Re: using the interfaces in ctype.h
Next by Thread: Re: using the interfaces in ctype.h
Indexes:

Home | Main Index | Thread Index | Old Index