Subject: tales of woe from the recent tools/compat_defs.h pullups on netbsd-1-6
To: NetBSD Toolchain Technical Discussion List <tech-toolchain@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: port-i386
Date: 06/21/2004 18:34:17
So, I just spent a good 6 hours or so trying to debug what turned out to
be a missing prototype for strtouq() in the host tools environment.

I'll document my experience just in case it proves helpful to anyone
facing similar problems someday.

This all starts out because in my local source tree I have some changes
in pw_scan() that allow me to have larger IDs.

Sometime in the last week or so something changed such that now the
tools version of nbpwd_mkdb fails when handling the following
master.passwd entry, but oddly the new system pwd_mkdb, built in the
same build environment, still works just fine.

nfsanon:*:4294967294:4294967294::0:0:NFS anonymous user:/nonexistant:/sbin/nologin

I.e. I now have UID_MAX and GID_MAX set as follows in <sys/syslimits.h>:

#define	GID_MAX         (~((uid_t)0))	/* max value for a gid_t (2^32) */
#define	UID_MAX         (~((uid_t)0))	/* max value for a uid_t (2^32) */

It seemed the best way to implement this while retaining full error
detection for bad values would be to use the next larger sized unsigned
integer when parsing the input numbers (since strtoul() can't, or at
least our version doesn't, do reliable overflow detection when the
maximum value expected is the same as ULONG_MAX) and so my pw_scan() now
does this:

  ...
	u_quad_t id;
  ...
	id = strtouq(p, &ep, 10);
  ...
	if (id > (u_quad_t) UID_MAX) {
		if (!(inflags & _PASSWORD_NOWARN))
			warnx("uid too big, '%s' > %llu", p, (u_quad_t) UID_MAX);

The result for pwd_mkdb is normal but nbpwd_mkdb does as follows

	$ ./nbpwd_mkdb -p -L -d ~/tmp ~/tmp/etc/master.passwd                                                     
	pwd_mkdb: uid too big, '4294967294' > 4294967295
	pwd_mkdb: at line #53
	pwd_mkdb: /home/proven/woods/tmp/etc/master.passwd: Inappropriate file type or format

Now the assertion in the first message looks a little confusing/wrong,
but that's because the first number is actually just the string that was
passed to strtouq(), but the second number is the conversion of UID_MAX.

So, "What the heck happened?" I asked myself.  This all worked yesterday
on the Alpha, and last week on this same i386 machine!

I haven't made any source changes in anything I would expect to be
related -- I've only run "cvs update" from the netbsd-1-6 branch, and
the only change I can see that's been made since the last successful
build are the changes to tools/compat/compat_defs.h, and as it turns out
this is indeed the cause.

I walked through the execution of both nbpwd_mkdb and pwd_mkdb with gdb
and was surprised to see that in the former (only) the return value of
strtouq() changed magically from the expected '4294967294' just before
the 'ret' to a wildly wrong value that was assigned to the "id" variable
and this seems to be because the top four bytes of the value are wrong

	(gdb) print id
	$1 = 13816982054056755198
	(gdb) print /x id
	$2 = 0xbfbfc7f8fffffffe

At first I didn't think this looked like a sign extension bug.

Initially I had been trying to use '-march=pentiumpro' in both
HOST_CFLAGS and CFLAGS and it took me some time to realize that wasn't
the problem.

Then I checked the post-processed code, which is identical except for
the the "__nbcompat_" prefix on the function name, including the
typedefs for "u_quad_t" which work back to "unsigned long long int" in
both compilation envrionments as expected.

The only obvious difference was of course is which compiler was used.
However the weird thing is that the host compiler was built by a
previous revision of the same source tree over a year ago and it
obviously built a working pwd_mkdb back then, and it also worked for
full i386-i386 and i386-sparc builds a week or so ago, so unless bit-rot
had set in behind my back, nothing had changed recently in the host
compiler.  I.e. I would have been less surprised if the new "tools"
compiler generated the buggy code, but that doesn't seem to be what
happened because if I run the latest build of pwd_mkdb from
/usr/src/usr.sbin/pwd_mkdb/obj.i386 it works just fine (as the
disassembly above of the strtouq() call from pw_scan() would suggest it
should).

Now my i386 assembly language skills were never as good as my pdp-11 and
6502 and even vax assembly skills, but I soon discovered what seemed to
be a bad botch in the code generated in the broken version of the call:

0x8049ac3 <__nbcompat_pw_scan+159>:     add    $0xfffffffc,%esp
0x8049ac6 <__nbcompat_pw_scan+162>:     push   $0xa
0x8049ac8 <__nbcompat_pw_scan+164>:     lea    0xfffffffc(%ebp),%ebx
0x8049acb <__nbcompat_pw_scan+167>:     push   %ebx
0x8049acc <__nbcompat_pw_scan+168>:     push   %esi
0x8049acd <__nbcompat_pw_scan+169>:     call   0x8053b3c <strtouq>
0x8049ad2 <__nbcompat_pw_scan+174>:     mov    %eax,%edx
0x8049ad4 <__nbcompat_pw_scan+176>:     mov    %eax,%ecx
0x8049ad6 <__nbcompat_pw_scan+178>:     sar    $0x1f,%ecx

I.e. what the heck is that 'sar' instruction doing in there!?!?!?  Ah
ha!  It _is_ a sign extension bug!

The good version in the new libc.a is of course just:

0x80499a3 <pw_scan+159>:        add    $0xfffffffc,%esp
0x80499a6 <pw_scan+162>:        push   $0xa
0x80499a8 <pw_scan+164>:        lea    0xfffffffc(%ebp),%ebx
0x80499ab <pw_scan+167>:        push   %ebx
0x80499ac <pw_scan+168>:        push   %esi
0x80499ad <pw_scan+169>:        call   0x804a284 <_strtouq>
0x80499b2 <pw_scan+174>:        mov    %edx,%ecx
0x80499b4 <pw_scan+176>:        mov    %eax,%edx

Finally with this knowledge I went back to the "cvs update" and using
CHANGES-1.6.3 as a guide I soon zeroed in on the tools/compat_defs.h
pullup and sure enough I soon realized that with that change the host
tools environment suddenly had no prototype for strtouq(), etc. and thus
the silent sign extension occurs.

Don't you think it would be a good idea to add some compiler warnings to
the host tools build (at least when the compiler is GCC)?

	-Wall -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-uninitialized  -Werror

This would have saved me a lot of time and headache because it would
have pointed out immediately (or at least after the tools/compat_defs.h
pullup was merged into my tree), that I had added a call to a
non-standard function in code that was used in the host tools
environment.

For my immediate purposes my fix is simply to add a prototype for
strtouq() to pw_scan.c since I only do builds on NetBSD iteslf.  I'll
back out the compat_defs.h changes just to be sure too though.....

I suppose that if I wanted to keep the host tools portable I would need
to support an optional strtouq() function to the nbcompat library.....

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>