Subject: Re: random userland errors
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
From: Eduardo E. Horvath <eeh@one-o.com>
List: port-sparc
Date: 03/03/2000 10:32:02
> I've got an ELC.  I'm trying to move a mud I co-run to it.  The mud
> takes up nearly 30M of virtual space, so I loaded the ELC up on memory.
> 
> Trouble is, it crashes randomly.
> 
> My first suspicion would be flaky hardware, probably memory.  But I've
> been swapping memory, and it still crashes, even on completely
> different memory (I've tried two 16M SIMMs = 32M, I've tried 3x4M+1x16M
> (a different 16M), I've tried various other combinations...none of them
> help).
> 
> I've tried a total of three different CPU boards.  No visible
> difference.
> 
> This leaves only two possibilities, to my mind: (1) bugs in the program
> itself and (2) bugs in the kernel.  I'm inclined to discount (1) as a
> possibility because the crashes are almost all "impossible" crashes
> (stuff like strcpy from a nil pointer, when the strcpy call is
> protected by an if testing the pointer); also, the same code seems
> happy elsewhere (heavily stressed on a non-NetBSD 68k machine; less
> heavily stressed on a SunOS SPARC).  And I can't see how (2) could do
> it either, since most of the tests are done with enough memory that
> there's no way it's even thinking about paging.  Usually the stack is
> complete garbage, as if the stack pointer got trashed.  I suppose it
> could be failing to restore registers correctly on a context switch,
> but I can't see how a bug like that would go undetected for so long;
> besides, I have another NetBSD/sparc machine that pounds much more
> heavily on the context switch code and it shows no such symptoms.
> 
> Which leaves me with no hypotheses.  Does anyone have *any* ideas?
> Anyone with experience tracking down this sort of very-intermittent
> trouble?

I have several ideas, but you probably won't like them.

When I try to hunt these sorts of things down, I add a circular buffer
to trace all kernel traps and save the registers at the time, that way
you know what the system was doing before the error.  This may cause
noticable system performance degradation.  I also put a break point in
the kernel signal delivery or address fault failure code path so the
system stops when the error occurs but before the process terminates.
Then, just wait till the system breaks into ddb.

When the system stops I generally go through register stack and
disassemble the userland code to determine whether the data in the
registers corresponds to the data in memory.  If it doesn't, you know
that the data was corrupted either during the load by data cache
issues, or afterwards by context switch issues.  Then you can check
your trap trace buffer to try to determine what the registers were
like at earlier traps.

A good book I can recommend that might help in debugging this is
_Panic!_ by Chris Drake and, uh, I forget who.  It should be available
through Sun and specifically explains how to analyze wierd behavior on
SPARCs.

=========================================================================
Eduardo Horvath				eeh@netbsd.org
	"I need to find a pithy new quote." -- me