tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: once again, some discussion about <ctype.h> interfaces....



OK, let's try that again.

Unfortunately it seemed I had fallen off the tech-userlevel list quite
some time ago but not noticed as I was still occasionally seeing posts
that had been cross-posted to tech-userlevel, but I didn't notice that
was not the route I had received them by.

Thanks all, especially Joerg for being first, for the hint about
multiple evaluations.  I should have thought of that, but I was getting
tired, and I was trying to avoid having to use GCC's compound statement
expressions.

I've given in to GCC for the moment (actually I think this is a really
important feature for C, though I've long avoided using it).

I've also updated the comments regarding the sometimes correctness of
code which might pass signed char to these "functions".  Too bad it's
not still 1988 or so -- I'd blast the committee a few of my thoughts on
breaking existing code bases!  (I do need to look up the rationale for
changing <ctype.h> interfaces from macros to functions to see if it has
any merit at all....)

Not turning on -Wunreachable-code avoids most warnings, unless the
parameter is an unsigned char.  I haven't looked to see if that's a
"standard" (-Wall) warning or not.  I wish GCC had IRIX C's ability to
knock off known warnings with comments in the code.

Comments welcomed again!  :-)

--- ctype.h     27 Jan 2013 19:31:17 -0800      1.29
+++ ctype.h     29 Jan 2013 19:43:07 -0800      
@@ -84,6 +84,111 @@
 #endif
 __END_DECLS
 
+/*
+ * XXX the following style of implementation fixes the problem of passing
+ * a signed char (with a negative value) to the <ctype.h> macros:
+ *
+ * These do rely on the compiler supporting compound statement expressions, and
+ * typeof(), as introduced long ago by GCC, and now in front of the ISO C
+ * working group as potential extensions for ISO/IEC 9899.  Currently it seems
+ * at least each of Microsoft C, IBM XL C, and Clang/LLVM already support this
+ * syntax. See:
+ *
+ *     http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1229.pdf
+ *
+#define isdigit(c)                                                     \
+       ({                                                              \
+               const __typeof__(c) _c = (c);                           \
+               (sizeof(c) > 1 && (_c) == EOF) ? 0 :                    \
+                       ((int) ((_ctype_ + 1)[(unsigned char) (_c)] & _N)); \
+       })
+ *
+ * It is quite common for older C code, and sometimes even modern Standard C
+ * code, to pass plain "char" values to <ctype.h> interfaces, but now NetBSD
+ * warns of these uses, even if they are benign, and possibly even _correct_.
+ *
+ * Yes, _correct_.  For Standard C (from C99 all the way back to K&R), and for
+ * POSIX in the C locale, a plain char containing Standard characters should
+ * not never have a negative value (compilers for EBCDIC systems with
+ * CHAR_BIT==8 _should_ default char to unsigned char!).  (Indeed in K&R these
+ * "functions" were actually described as macros which accepted a parameter
+ * which was expected to be just a plain "char".)  The only potential problems
+ * are when dealing with characters that may have come from a file, _and_ which
+ * have their high (sign) bit set, _and_ IFF compiled with -fsigned-char (and
+ * _not_ -fno-signed-char); or possibly of some poorly designed code fails to
+ * check for EOF from the likes of getc() before passing the int value on to
+ * one of these functions.
+ *
+ * The simple fix is to add a cast to (unsigned char) for their parameters.
+ * However having to modify working legacy and/or correct modern code in this
+ * way is _terribly_ annoying, and should be unnecessary.
+ *
+ * If the <ctype.h> interfaces were implemented as above then they can deal
+ * safely and properly (and as expected if compiled with -funsigned-char) with
+ * plain char parameters which may contain negative values, while still dealing
+ * cleanly with the potential of being passed an int containing an EOF (-1)
+ * value, thus conforming very well with the full requirements of the standards
+ * while not making deamons fly out of anyone's nose when legacy code might
+ * process more modern data containing characters with their 8'th bit set.
+ * Indeed they _always_ just do The Right Thing and avoid the adverse affects
+ * of sign extension when signed chars are used as their parameters.
+ *
+ * Note that these implementations could also allow removal of the
+ * (_ctype_+1)[] trick, but that would mean fixing up the functions that load
+ * LC_CTYPE locales, plus the isctype.c functions, as well as of course the
+ * definitions of _C_ctype_, _C_tolower, and _C_toupper.
+ *
+ * The extra test here will normally be avoided with any decent compiler due to
+ * the compile time sizeof() comparison since quite often the parameter will be
+ * a char of some sort (i.e. have a sizeof() == 1).  (I find that with my own
+ * collection of code, use of "int" with <ctype.h> interfaces is far less
+ * common than uses of plain "char".)
+ *
+ * However the comparison with EOF may elicit an un-desired warning from most
+ * compilers, such as GCC's "warning: comparison is always false due to limited
+ * range of data type", or Clang's "warning: will never be executed
+ * [-Wunreachable-code]") whenever the parameter is an "unsigned char" due to
+ * the impossibility of any unsigned char's value ever matching -1, i.e. as a
+ * signed int by default.  I don't know how to avoid these warnings and still
+ * achieve the desired compatability with even the normal set of common
+ * parameter types (char, signed char, unsigned char, and int).
+ *
+ * Obviously this doesn't help the real libc <ctype.h> functions -- sign
+ * extension will still happen when an (signed char) parameter with a negative
+ * value is passed to one of them, and then they will be stuck with, at
+ * minimum, an in-ability to distinguish between 0xFF and -1 (EOF) and so
+ * cannot use either an (unsigned char) cast as above or a mask of 0xFF without
+ * potentially returning the wrong value when passed EOF.
+ *
+ * However, since any potential EOF value should in theory always be dealt with
+ * in such a way as to avoid ever calling any of the <ctype.h> functions with
+ * it in the first place, perhaps the libc <ctype.h> functions could/should
+ * just play it safe and effectively treat EOF (i.e. when passed as an "int")
+ * as if it were 0xFF?  I'm not even certain this would contravene any
+ * standards as I can't find any firm definition of what these i"functions"
+ * should return if passed EOF.  Really, how often does EOF actually get passed
+ * to these functions?  I'd hope it's approximatlely never, and that those are
+ * the only places there are any real bugs in legacy code that need fixing!
+ * (Though they are also likely harder bugs to fix!)  The only danger I see is
+ * that programs using toupper() or tolower() _and_ which might pass the EOF
+ * (as an int) to those functions, might return a valid character instead of
+ * EOF.
+ *
+ * Perhaps compilers would be smart to generate a warning whenever a narrower
+ * signed parameter will be sign extended (if negative) to widen it to match
+ * the prototype (or the default parameter conversions).  I don't off-hand
+ * recall many other APIs where widening from a smaller signed value to a
+ * larger signed value is the norm, except in similar cases where sign
+ * extension inevitably causes confusion and/or undefined behaviour as it can
+ * here.
+ *
+ * Perhaps it would also be wise to set NetBSD's compiler to default to
+ * treating un-qualified "char" types as "unsigned char" as well.  Having
+ * "char" be always unsigned would perhaps help "hide" (in a very good way!)
+ * bugs with legacy code having to deal with more 8-bit (e.g. Latin-1) input
+ * than it was originally written to deal with.
+ */
+/* XXX the masks here should be macros that can be shared with 
src/lib/libc/gen/isctype.c */
 #define        isdigit(c)      ((int)((_ctype_ + 1)[(c)] & _N))
 #define        islower(c)      ((int)((_ctype_ + 1)[(c)] & _L))
 #define        isspace(c)      ((int)((_ctype_ + 1)[(c)] & _S))

-- 
                                                Greg A. Woods
                                                Planix, Inc.

<woods%planix.ca@localhost>        +1 250 762-7675        http://www.planix.ca/

Attachment: pgpbrzuAe_996.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index