Subject: Re: lib/10010: toupper mangles non-uppercase characters
To: None <arp@tac.eu.org>
From: Greg A. Woods <woods@weird.com>
List: netbsd-bugs
Date: 04/29/2000 14:38:44
[ On Saturday, April 29, 2000 at 00:27:49 (-0400), Adam R. Prato wrote: ]
> Subject: lib/10010: toupper mangles non-uppercase characters
>
> >Description:
>  	toupper() returns changed values for short ints other than uppercase
> characters. For example: x8d -> x0d, x83 -> ^M, xa0-xbf -> various characters

Hmmm.... once upon a time, in BSD, the correct usage was only ever:

	if (islower(c))
		toupper(c);

and conversely:

	if (isupper(c))
		tolower(c);

If I remember correctly the history of this bug stems from the rather
unspecific description of these "functions" in the original edition of
K&R.  There's no mention of them in the V7 manuals.  So far as I know
these things where always macros up until about AT&T System III was
released (I seem to remember changes in Xenix-III) when they were
renamed with an '_' prefix (_toupper()) and the proper functions,
without a restricted input domain, were were introduced.  Looking at my
SysVr2 manuals I see that they are quite explicit about returning the
character unchanged if there's no valid conversion, and that the _to*()
macros would not do this.

Someone was apparently premature in stating the standards conformance
for toupper(3) [and tolower(3)] as far back as 4.3net2 -- and looking at
4.2's ctype.h the "bug" is obviously there.

However I don't understand why you're seeing problems in NetBSD, 1.4.2
especially.  This was all fixed in NetBSD back in 1993 (from ctype.h,v):

    revision 1.4
    date: 1993/08/06 22:05:29;  author: jtc;  state: Exp;  lines: +1 -1
    Rename tolower & toupper macros to _tolower and _toupper.
    Standard C requires tolower to return a character that is !isupper unchanged
    which was not being done with the macro.  The function version does the
    right thing, so the loss of the macro is no great deal.
    
    I didn't eliminate the macros entirely, since X/Open's XPG3 requires _tolower
    and _toupper with the same semantics.  But, like isascii/toascii, they are
    removed from the namespace if either ANSI_SOURCE or _POSIX_SOURCE is defined.

a wee bit later the functions were changed back to (faster) macros using
new lookup tables:

    revision 1.6
    date: 1993/08/06 23:19:51;  author: jtc;  state: Exp;  lines: +1 -1
    Declare translation tables for toupper and tolower.  To be replaced by
    pointers to the tables to the current locale.
    Reintroduce toupper and tolower macros that use the translation tables.

If you look in /usr/include/ctype.h you should find the declarations of
these macros:

	#define tolower(c)      ((int)((_tolower_tab_ + 1)[(int)(c)]))
	#define toupper(c)      ((int)((_toupper_tab_ + 1)[(int)(c)]))

and these original style ones:

	#define _tolower(c)     ((c) - 'A' + 'a')
	#define _toupper(c)     ((c) - 'a' + 'A')

If your code is really using the first macro (i.e. the one that gets its
return value from the _toupper_tab_) then it should work fine.  Check
that your code is using these macros by looking at the output of 'cc -E'
and searching for the place where you call toupper().  Are you doing
anything with a different locale?

In the mean time I'd recommend following Harbison and Steele's advice
for code that has to be portable and always use a wrapper *function*:

	#include <ctype.h>
	int safe_toupper(c)
		int	c;
	{
		if (islower(c))
			return tolower(c);
		else
			return c;
	}

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>