netbsd-bugs: port-i386/21255: i386 random stack alignment causes wildly inconsistent double-precision performance

Subject: port-i386/21255: i386 random stack alignment causes wildly inconsistent double-precision performance
To: None <gnats-bugs@gnats.netbsd.org>
From: None <jbernard@mines.edu>
List: netbsd-bugs
Date: 04/21/2003 17:41:01
>Number:         21255
>Category:       port-i386
>Synopsis:       i386 random stack alignment causes wildly inconsistent double-precision performance
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-i386-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 21 23:42:01 UTC 2003
>Closed-Date:
>Last-Modified:
>Originator:     Jim Bernard
>Release:        NetBSD 1.6Q
>Organization:
>Environment:
System: NetBSD nool 1.6Q NetBSD 1.6Q (NOOL-$Revision: 1.35 $) #0: Sat Mar 22 17:59:26 MST 2003 jim@roc:/wd1/var/tmp/compile/sys/arch/i386/compile/NOOL i386
Architecture: i386
Machine: i386
>Description:
	The performance of programs that use double-precision floating-point
	variables heavily can vary dramatically, depending on accidents of
	alignment of stack-resident variables.  I'll include a small
	demonstration program below that illustrates the effect.  With that
	program, I observed a 55% increase in execution time when the stack
	misaligns the double-precision array, relative to the time when it
	is aligned on an 8-byte boundary.  This is similar to the slowdown
	I've found in a production code written in Fortran.  (FWIW, this same
	test code exhibits stable, 8-byte alignment of doubles on a Linux/i386
	system.  I haven't tried it on other BSDs, etc.)

	There appear to be two places where the alignment problems occur.
	The first is in the execve system call (sys_execve in
	sys/kern/kern_exec.c).  There the argv array and the environment
	array are copied to the stack, and the stack size is rounded up
	to the nearest 4-byte boundary, using the ALIGN and ALIGNBYTES
	macros from sys/arch/i386/include/param.h.  Because the size is
	rounded only to a 4-byte boundary, the alignment of stack-resident
	double-precision variables in the program can be inefficient,
	depending on details of the argument list (e.g., the program name)
	and the environment (e.g., PWD and OLDPWD).  Variations in these
	cause the execution time to vary, seemingly randomly, between a
	fast time and a much slower time.

	Rounding up the stack size in execve to an 8-byte boundary removes
	the variability, but the result is guaranteed misalignment and
	slow execution all of the time.  Adding an additional 4-byte
	increment to the stack size in execve yields correct alignment
	and fast execution all the time.  So, there is evidently another
	place where the stack size is changed by an odd multiple of 4 bytes,
	and that change seems to be constant.

	I owe many thanks to Sverre Froyen for coming up with several
	critical insights that nailed this down.  In particular, he
	pointed the finger at execve and figured out that the environment
	was causing variable execution times, as well as generating a
	fix that proved this was the key to the varying alignment.

>How-To-Repeat:
	The code below demonstrates the effect.  It's a slightly modified
	version of a program found in:

	  http://compilers.iecc.com/comparch/article/00-11-168

	originally designed to demonstrate the importance of 8-byte
	alignment for double-precision calculations.  Here the array is
	declared, rather than being allocated with malloc, so we can
	observe the effects of execve on alignment of stack-resident
	double-precision variables.  (Thanks to Sverre for the idea to
	declare the array.)

#include <stdio.h>
#define N 10000
int main (int argc, char **argv) {
double *x;
int i, j;

/*
x=(double*)malloc((N+1)*sizeof(double));
 */
double y[N+1];
x = y;
/*
if(argc==2) x=(double*)((int)x+4);
 */

printf("0x%x\n", (int)x);
printf("%d\n", (int)x%8);
for(i=0;i<N;i++) x[i]=(double)i;
for(i=0;i<N;i++) for (j=0;j<N;j++) x[i]=0.5*(x[j]+x[i]);

printf("%f\n", x[N-1]);
exit(0);
}

	To observe the effect:

	  copy the code to a file, say dbl.c
	  cc -O2 dbl.c
	  ln a.out a
	  ln a.out aa
	  ln a.out aaa
	  ln a.out aaaa
	  time ./a
	  time ./aa
	  time ./aaa
	  time ./aaaa

	You will notice that two of the executions are fast, and two
	are slow.  For those that are fast, the code will print a 0
	for x%8, indicating 8-byte alignment of x.  For those that are
	slow, it will print +-4, indicating misalignment.  The variation
	is due to the different lengths of argv[0].  The numerical results
	of all of the calculations will be identical.

	You can also try changing your environment (e.g., through a
	couple of cd's) to see its effect on the execution times.
	Changes in the lengths of PWD and OLDPWD (e.g.) can change
	the stack size and thus the alignment of x.

>Fix:
	Changing ALIGNBYTES to (sizeof(double) - 1) in param.h and
	adding 4 to len (the stack size) in sys_execve in kern_exec.c
	(just after len = ALIGN(len)) does the trick, at least if only
	the kernel is rebuilt.

	But both of these changes may be a bit dangerous, and they
	don't seem ideal.  Both ALIGN and ALIGNBYTES are used in a
	number of places in the kernel, in libc, and elsewhere, and
	it may not be desirable to change the alignment in all those
	places.  I note that the arm port defines STACKALIGN and
	STACKALIGNBYTES macros in its param.h, which sound like good
	names to use in execve, but they're not used there.  Maybe
	they should be.

	Certainly it would not be friendly to other ports
	to arbitrarily add 4 to the stack size in execve, but maybe
	adding sizeof(int) or some such would be acceptable.
	I don't know what is the source of the additional change in
	the stack size by an odd multiple of 4 bytes.  I've looked
	a bit at crt0.c (lib/csu/i386_elf/crt0.c), which seems like a
	good candidate, but I haven't managed to figure out exactly
	what its effect on the stack is, so I'll leave it to someone
	familiar with that code to sort out whether it is at fault
	and whether the fix should go there or in execve.

	Also, given that this is probably going to be easy to break
	in the future, it might be a good idea to put something like
	the test program above into a regression test.
>Release-Note:
>Audit-Trail:
>Unformatted: