Subject: Re: tales of woe from the recent tools/compat_defs.h pullups on netbsd-1-6
To: Greg A. Woods <woods@weird.com>
From: James Chacon <jmc@NetBSD.org>
List: tech-toolchain
Date: 06/21/2004 20:49:20
On Mon, Jun 21, 2004 at 06:34:17PM -0400, Greg A. Woods wrote:
> So, I just spent a good 6 hours or so trying to debug what turned out to
> be a missing prototype for strtouq() in the host tools environment.

Try pulling rev 1.40 of tools/compat/compat_defs.h into your tree.

I'm testing it on the 1.6 branch now and will get it pulled up provided nothing
breaks.

James


> I'll document my experience just in case it proves helpful to anyone
> facing similar problems someday.
> 
> This all starts out because in my local source tree I have some changes
> in pw_scan() that allow me to have larger IDs.
> 
> Sometime in the last week or so something changed such that now the
> tools version of nbpwd_mkdb fails when handling the following
> master.passwd entry, but oddly the new system pwd_mkdb, built in the
> same build environment, still works just fine.
> 
> nfsanon:*:4294967294:4294967294::0:0:NFS anonymous user:/nonexistant:/sbin/nologin
> 
> I.e. I now have UID_MAX and GID_MAX set as follows in <sys/syslimits.h>:
> 
> #define	GID_MAX         (~((uid_t)0))	/* max value for a gid_t (2^32) */
> #define	UID_MAX         (~((uid_t)0))	/* max value for a uid_t (2^32) */
> 
> It seemed the best way to implement this while retaining full error
> detection for bad values would be to use the next larger sized unsigned
> integer when parsing the input numbers (since strtoul() can't, or at
> least our version doesn't, do reliable overflow detection when the
> maximum value expected is the same as ULONG_MAX) and so my pw_scan() now
> does this:
> 
>   ...
> 	u_quad_t id;
>   ...
> 	id = strtouq(p, &ep, 10);
>   ...
> 	if (id > (u_quad_t) UID_MAX) {
> 		if (!(inflags & _PASSWORD_NOWARN))
> 			warnx("uid too big, '%s' > %llu", p, (u_quad_t) UID_MAX);
> 
> The result for pwd_mkdb is normal but nbpwd_mkdb does as follows
> 
> 	$ ./nbpwd_mkdb -p -L -d ~/tmp ~/tmp/etc/master.passwd                                                     
> 	pwd_mkdb: uid too big, '4294967294' > 4294967295
> 	pwd_mkdb: at line #53
> 	pwd_mkdb: /home/proven/woods/tmp/etc/master.passwd: Inappropriate file type or format
> 
> Now the assertion in the first message looks a little confusing/wrong,
> but that's because the first number is actually just the string that was
> passed to strtouq(), but the second number is the conversion of UID_MAX.
> 
> So, "What the heck happened?" I asked myself.  This all worked yesterday
> on the Alpha, and last week on this same i386 machine!
> 
> I haven't made any source changes in anything I would expect to be
> related -- I've only run "cvs update" from the netbsd-1-6 branch, and
> the only change I can see that's been made since the last successful
> build are the changes to tools/compat/compat_defs.h, and as it turns out
> this is indeed the cause.
> 
> I walked through the execution of both nbpwd_mkdb and pwd_mkdb with gdb
> and was surprised to see that in the former (only) the return value of
> strtouq() changed magically from the expected '4294967294' just before
> the 'ret' to a wildly wrong value that was assigned to the "id" variable
> and this seems to be because the top four bytes of the value are wrong
> 
> 	(gdb) print id
> 	$1 = 13816982054056755198
> 	(gdb) print /x id
> 	$2 = 0xbfbfc7f8fffffffe
> 
> At first I didn't think this looked like a sign extension bug.
> 
> Initially I had been trying to use '-march=pentiumpro' in both
> HOST_CFLAGS and CFLAGS and it took me some time to realize that wasn't
> the problem.
> 
> Then I checked the post-processed code, which is identical except for
> the the "__nbcompat_" prefix on the function name, including the
> typedefs for "u_quad_t" which work back to "unsigned long long int" in
> both compilation envrionments as expected.
> 
> The only obvious difference was of course is which compiler was used.
> However the weird thing is that the host compiler was built by a
> previous revision of the same source tree over a year ago and it
> obviously built a working pwd_mkdb back then, and it also worked for
> full i386-i386 and i386-sparc builds a week or so ago, so unless bit-rot
> had set in behind my back, nothing had changed recently in the host
> compiler.  I.e. I would have been less surprised if the new "tools"
> compiler generated the buggy code, but that doesn't seem to be what
> happened because if I run the latest build of pwd_mkdb from
> /usr/src/usr.sbin/pwd_mkdb/obj.i386 it works just fine (as the
> disassembly above of the strtouq() call from pw_scan() would suggest it
> should).
> 
> Now my i386 assembly language skills were never as good as my pdp-11 and
> 6502 and even vax assembly skills, but I soon discovered what seemed to
> be a bad botch in the code generated in the broken version of the call:
> 
> 0x8049ac3 <__nbcompat_pw_scan+159>:     add    $0xfffffffc,%esp
> 0x8049ac6 <__nbcompat_pw_scan+162>:     push   $0xa
> 0x8049ac8 <__nbcompat_pw_scan+164>:     lea    0xfffffffc(%ebp),%ebx
> 0x8049acb <__nbcompat_pw_scan+167>:     push   %ebx
> 0x8049acc <__nbcompat_pw_scan+168>:     push   %esi
> 0x8049acd <__nbcompat_pw_scan+169>:     call   0x8053b3c <strtouq>
> 0x8049ad2 <__nbcompat_pw_scan+174>:     mov    %eax,%edx
> 0x8049ad4 <__nbcompat_pw_scan+176>:     mov    %eax,%ecx
> 0x8049ad6 <__nbcompat_pw_scan+178>:     sar    $0x1f,%ecx
> 
> I.e. what the heck is that 'sar' instruction doing in there!?!?!?  Ah
> ha!  It _is_ a sign extension bug!
> 
> The good version in the new libc.a is of course just:
> 
> 0x80499a3 <pw_scan+159>:        add    $0xfffffffc,%esp
> 0x80499a6 <pw_scan+162>:        push   $0xa
> 0x80499a8 <pw_scan+164>:        lea    0xfffffffc(%ebp),%ebx
> 0x80499ab <pw_scan+167>:        push   %ebx
> 0x80499ac <pw_scan+168>:        push   %esi
> 0x80499ad <pw_scan+169>:        call   0x804a284 <_strtouq>
> 0x80499b2 <pw_scan+174>:        mov    %edx,%ecx
> 0x80499b4 <pw_scan+176>:        mov    %eax,%edx
> 
> Finally with this knowledge I went back to the "cvs update" and using
> CHANGES-1.6.3 as a guide I soon zeroed in on the tools/compat_defs.h
> pullup and sure enough I soon realized that with that change the host
> tools environment suddenly had no prototype for strtouq(), etc. and thus
> the silent sign extension occurs.
> 
> Don't you think it would be a good idea to add some compiler warnings to
> the host tools build (at least when the compiler is GCC)?
> 
> 	-Wall -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-uninitialized  -Werror
> 
> This would have saved me a lot of time and headache because it would
> have pointed out immediately (or at least after the tools/compat_defs.h
> pullup was merged into my tree), that I had added a call to a
> non-standard function in code that was used in the host tools
> environment.
> 
> For my immediate purposes my fix is simply to add a prototype for
> strtouq() to pw_scan.c since I only do builds on NetBSD iteslf.  I'll
> back out the compat_defs.h changes just to be sure too though.....
> 
> I suppose that if I wanted to keep the host tools portable I would need
> to support an optional strtouq() function to the nbcompat library.....
> 
> -- 
> 						Greg A. Woods
> 
> +1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
> Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>
> 
>