Subject: more about SIGSEGV in bonnie on Alpha: is stdarg buggy for doubles?
To: NetBSD Toolchain Technical Discussion List <tech-toolchain@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-toolchain
Date: 10/23/2004 16:21:05
As I've reported sporadically for almost a year now I've been worrying
away at a bug that I thought was in benchmarks/bonnie or in GCC.  Bonnie
running on my Alpha encounters a SIGSEGV when reporting its final
results, specifically when trying to format a double float value with
"%4.1f".  Over the last couple of days I've delved ever deeper with
testing different versions of GCC and with attempts to reproduce the bug
in a more trivial test program.

However now I've come to a brick wall as none of my recent observations
make very much sense to me.  The best I can tell at this point is that
there's either some kind of problem on the Alpha with stdarg/vararg
handling for double floats, or some kind of kernel bug affecting the
process context for floating point registers, or both.

The following more trivial program reproduces the exact same assembler
code for the report() function (at least with '-O0') as I have in my now
hacked up bonnie.c.  However this test program does _not_ crash while
my bonnie still does.  :-(

#include <stdio.h>

static double delta[6][2];

int main(int, char **);
static void report(char *, off_t);

int
main(
	int	argc,
	char	*argv[]
    )
{
	off_t size = 20480000;
	
	delta[0][0] = 100000.1;
	delta[0][1] = 2.2;

	report(argv[0], size);

	exit(0);
	/* NOTREACHED */
}

static void
report(
	char *machine,
	off_t size
      )
{
	static double dtmpta = 0.0;
	static double dtmptb = 0.0;
	static double dtmp = 0.0;

	printf("%-8.8s size %4u\n", machine, (unsigned int) (size / (1024 * 1024)));

	printf("an int = %5d\n\n", (int) (((double) size) / (delta[0][1] * 1024)));

	dtmpta = delta[0][0];
	dtmptb = delta[0][1];
	dtmptb *= 100.0;
	dtmp = dtmpta / dtmptb;

	printf("dtmpta = %4.1f\n", dtmpta);	/* XXX bonnie crashes in this call */
	printf("dtmptb = %4.1f\n", dtmptb);
	printf("dtmp = %4.1f\n", dtmp);
}



Here's an exmple of the crash in bonnie as seen by gdb.  Note how the
stack seems trashed in all but frames #5 and #6.  Previously I had
wasted a lot of time looking at the trashed frames and trying to detect
when they got that way but I could never get anywhere with it nor make
any sense of it from the source-code level, which made me suspect the
compiler.

Program received signal SIGSEGV, Segmentation fault.
Balloc (k=0) at /proven/work/woods/m-NetBSD-1.6/lib/libc/stdlib/strtod.c:418
418                     rv->k = k;
(gdb) where
#0  Balloc (k=0)
    at /proven/work/woods/m-NetBSD-1.6/lib/libc/stdlib/strtod.c:418
#1  0x12000d2d8 in __dtoa (_d=-0.73139494621162238, mode=3, ndigits=1, 
    decpt=0x1ffff96f8, sign=0x4, rve=0x1ffff9418)
    at /proven/work/woods/m-NetBSD-1.6/lib/libc/stdlib/strtod.c:2205
#2  0x120009bc4 in cvt (value=0.18560199999999993, ndigits=1, flags=2048, 
    sign=0x1ffff96f4 "", decpt=0x1ffff96f8, ch=102, length=0x1ffff96fc)
    at /proven/work/woods/m-NetBSD-1.6/lib/libc/stdio/vfprintf.c:826
#3  0x120008728 in vfprintf (fp=0x120024950, 
    fmt0=0x1 <Error reading address 0x1: Invalid argument>, ap={
      __base = 0x1ffff97e0 "\201+\001 \001", __offset = 16, __pad = 0})
    at /proven/work/woods/m-NetBSD-1.6/lib/libc/stdio/vfprintf.c:512
#4  0x120003a8c in printf (fmt=0x1ffff9310 "Ëÿÿÿ3")
    at /proven/work/woods/m-NetBSD-1.6/lib/libc/stdio/printf.c:75
#5  0x120002074 in report (machine=0x1fffff08c "gdb", size=1048576)
    at bonnie.c:452
#6  0x120001b40 in main (argc=5, argv=0x1ffffed38) at bonnie.c:389
(gdb) up
#1  0x12000d2d8 in __dtoa (_d=-0.73139494621162238, mode=3, ndigits=1, 
    decpt=0x1ffff96f8, sign=0x4, rve=0x1ffff9418)
    at /proven/work/woods/m-NetBSD-1.6/lib/libc/stdlib/strtod.c:2205
2205            result = Balloc(result_k);
(gdb) up
#2  0x120009bc4 in cvt (value=0.18560199999999993, ndigits=1, flags=2048, 
    sign=0x1ffff96f4 "", decpt=0x1ffff96f8, ch=102, length=0x1ffff96fc)
    at /proven/work/woods/m-NetBSD-1.6/lib/libc/stdio/vfprintf.c:826
826             digits = __dtoa(value, mode, ndigits, decpt, &dsgn, &rve);
(gdb) up
#3  0x120008728 in vfprintf (fp=0x120024950, 
    fmt0=0x1 <Error reading address 0x1: Invalid argument>, ap={
      __base = 0x1ffff97e0 "\201+\001 \001", __offset = 16, __pad = 0})
    at /proven/work/woods/m-NetBSD-1.6/lib/libc/stdio/vfprintf.c:512
512                             cp = cvt(_double, prec, flags, &softsign,
(gdb) up
#4  0x120003a8c in printf (fmt=0x1ffff9310 "Ëÿÿÿ3")
    at /proven/work/woods/m-NetBSD-1.6/lib/libc/stdio/printf.c:75
75              ret = vfprintf(stdout, fmt, ap);
(gdb) up
#5  0x120002074 in report (machine=0x1fffff08c "gdb", size=1048576)
    at bonnie.c:452
452       printf("%4.1f\n", dtmpta);
(gdb) list
447       dtmpta = delta[(int) Putc][CPU];
448       dtmptb = delta[(int) Putc][Elapsed];
449       dtmptb *= 100.0;
450       dtmp = dtmpta / dtmptb;
451     
452       printf("%4.1f\n", dtmpta);
453       printf("%4.1f\n", dtmptb);
454       printf("%4.1f\n", dtmp);
455     
456     
(gdb) info locals
dtmpta = 0.18560199999999993
dtmptb = 17.743300000000001
dtmp = 0.010460399136575491
(gdb) 



I was getting different assembler code for the report() function with
'-O2' and I thought at first that might be the cause, but with '-O0' the
assembler code is as identical as one would expect -- at least as far as
I can tell without learning a lot more about Alpha assembler and gcc
generated code -- and bonnie still crashes while the test program of
course still works fine.

If you compare the original bonnie.c code with what you can see above
via gdb you'll see I also tried to simplify the expressions as much as
possible by using extra temporary variables (which I also made static to
avoid stack-frame issues and to help the debugger see them).

Now here's the weird part.  If the two 2-dimensional array references
are commented out (i.e. lines 447 and 448 in the gdb list display above)
then the printf()s work just fine, even with '-O2'.  (showing "nan" for
dtmp of course because of the division by zero)

I.e. it is not a libc bug -- at least not obviously as recompiling just
the one module with the array references changes the behaviour entirely.
It does indeed seem to be a compiler bug, and one that's in the
traditionally difficult areas of complex pointer or array referencing.

Note the same crash happens with either the 1.6.x compiler (2.95.3) or
with GCC-3.3.3 compiled from pkgsrc (keeping in mind that libc was still
compiled with the 1.6.x build toolchain and I've only compiled bonnie.c
itself with the newer compiler).  GCC-3.3.3 also gives the same
"success" result when the array references are commented out, and again
the report() function in both the trivial test program and in bonnie
appear to have the exact same instruction sequences in the assembler
code.

The difference in the generated code must be much more subtle than the
instruction sequences -- perhaps due to my inexperience with Alpha
assembler I'm missing some difference in offsets or side effects with
registers or similar.

Using GCC-2.95.3 the _only_ difference in the generated code is that the
array references and assignments add these instructions:

in bonnie.c:

$ diff bonnie.s bonnie.s-all
1797a1798,1805
>       lda $1,dtmpta.12
>       lda $2,delta
>       ldt $f10,0($2)
>       stt $f10,0($1)
>       lda $1,dtmptb.13
>       lda $2,delta
>       ldt $f10,8($2)
>       stt $f10,0($1)

in tfloat.c:

$ diff tfloat.s tfloat.s-all
145a146,153
>       lda $1,dtmpta.9
>       lda $2,delta
>       ldt $f10,0($2)
>       stt $f10,0($1)
>       lda $1,dtmptb.10
>       lda $2,delta
>       ldt $f10,8($2)
>       stt $f10,0($1)


Similarly with the GCC-3.3.3 case the only difference in the generated
code is that the array references and assignments add these
instructions:

in bonnie.c (with -g):

$ diff bonnie.s bonnie.s-all
2075a2076,2086
>       .loc 1 446 0
>       ldah $2,dtmpta.0($29)           !gprelhigh
>       ldah $1,delta($29)              !gprelhigh
>       ldt $f10,delta($1)              !gprellow
>       stt $f10,dtmpta.0($2)           !gprellow
>       .loc 1 447 0
>       ldah $2,dtmptb.1($29)           !gprelhigh
>       ldah $1,delta($29)              !gprelhigh
>       lda $1,delta($1)                !gprellow
>       ldt $f10,8($1)
>       stt $f10,dtmptb.1($2)           !gprellow

in tfloat.c (without -g):

$ diff tfloat.s tfloat.s-all
166a167,175
>       ldah $2,dtmpta.0($29)           !gprelhigh
>       ldah $1,delta($29)              !gprelhigh
>       ldt $f10,delta($1)              !gprellow
>       stt $f10,dtmpta.0($2)           !gprellow
>       ldah $2,dtmptb.1($29)           !gprelhigh
>       ldah $1,delta($29)              !gprelhigh
>       lda $1,delta($1)                !gprellow
>       ldt $f10,8($1)
>       stt $f10,dtmptb.1($2)           !gprellow


As you can see the added instruction sequences are identical in both
programs and though each compiler produces somewhat different code the
relationship is obvious, so from what I understand the effect they have
only in bonnie.c must somehow be related to some other code elsewhere
outside the report() function -- some register is being overwritten or
something like that.

Perhaps these instructions somehow bugger up the way printf() pulls
"double float" parameters off the stack, but only if some other
difference I haven't spotted yet between bonnie.c and my test program
gets involved as well.

That might explain why code compiled by either GCC 2.95.3 or 3.3.3, with
either -O0 or with -O2, all suffers the same crash, but only in bonnie.c
(i.e. even when the local function involved is identical and has
identical generated code) -- some rarely used floating-point only
assumption about register state in the stdarg/vararg handling, in
another module entirely, combined with some other side-effect elsewhere
in bonnie.c, is violated by these "normal" load and store operations.
Maybe it's even a kernel bug related to floating point register
handling, though somehow that seems less likely to me.

I thought I could reproduce enough of the complexity in the test program
to trigger the same bug in a more trivial setting, but given my initial
and perhaps flawed assumption that it was a compiler bug the trigger may
be far more subtle than I first suspected.  I really wish I could find
that trigger code in bonnie.c so that my test program would also crash
and so that I could see the actual difference causing the crash, but
hopefully I've given enough information here that someone more
knowledgeable about Alpha assembler can find and help fix the actual bug
itself even without reproducing the crash in another test program.

Note also the error I'm getting is occuring on a multi-processor AS4000,
though if I remember correctly it also happened with a non-SMP kernel.
I will try to remember to boot a non-SMP kernel and test again, but I
won't be albe to do that for a few days.  Note though the crash is 100%
always reproducible when running the SMP kernel regardless of load or
other activity, so I wouldn't expect any difference in behaviour with a
non-SMP kernel, despite the more complex issues w.r.t. handling floating
point registers.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>