tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: using the interfaces in ctype.h

On 22-Apr-08, at 3:58 AM, Alan Barrett wrote:

On Mon, 21 Apr 2008, Greg A. Woods; Planix, Inc. wrote:
Besides, the standards don't, so far as I can tell, require
implementations to always return zero for all the is*() APIs when EOF
is passed to them.  This whole "the mask prevents the implementation
from distinguishing between 0xFF and EOF" claim is completely bogus.
It just doesn't matter what these functions return when passed EOF --
their result in that case is undefined anyway.

Since you are the only person who appears to believe the above claims,
please justify them with detailed references to sections in the
C99 standard.

On further reading I'll retract that claim. Sorry I was a little bit out of line in what I said there before I had fully considered the implications of LC_CTYPE support and the implications of ISO-8859.

I do now seem to remember thinking a very long time ago when I first encountered ISO-8859 that there were going to be problems for some implementations due to the unfortunate use of the 0xFF code for a valid character.

Unfortunately it seems the C99 standard has particularly obtuse wording that dances around the subject. I can't find anything definitive without quoting massive amounts of diversely spread references, but then again the standard is so careful to avoid saying anything specific about any implementation details that it even dances around defining EOF such that under their definition it can be pretty much any negative integer value if I'm reading things right (eg. in 7.19.1).

The Single UNIX Standard is similarly waffling about EOF, but it does at least imply that the is*() APIs should return zero for anything other than the type of thing they're supposed to be matching and thus implying that, since EOF is by definition never anything that can be matched as a valid character of any type, they must all return zero when passed EOF. Using a mask in the way I suggested works fine for ASCII of course as well as for some of the ISO-8859 charsets, but unfortunately not all of them, and not even .1.

Now I think I better understand the #if USE_ASCII conditional code in Darwin's ctype.h. I like this aspect of the Darwin implementation the best I think. It provides the very best possible performance for the simple ASCII-only case, and then jumps right into using functions (inline where possible) for anything beyond ASCII. There's still a short-circuit in the core function used by some of the API functions where ASCII-only values are handled by a direct array access, thus avoiding a second function call.

The OpenBSD implementation is probably second best, though by far it is the most readable and easiest to understand. It uses simple in- inline function calls to avoid the issue of multiple references to the macro argument when testing to see if the value is EOF before masking it and using the masked value as the array index. On first glance I think the OpenBSD implementation also has the advantage of being 100% compatible with all the other innards of NetBSD's libc and so it's probably the easiest one to borrow, i.e. should NetBSD also decide that it's better to be safe and prevent out-of-bounds array accesses. (The current OpenBSD implementation may still return bogus values for some other negative numbers though.) Perhaps I'll slurp in their code as a starting base in my tree and see how it performs.

                                        Greg A. Woods; Planix, Inc.

Home | Main Index | Thread Index | Old Index