Subject: random userland errors
To: None <port-sparc@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: port-sparc
Date: 03/03/2000 02:10:27
I've got an ELC.  I'm trying to move a mud I co-run to it.  The mud
takes up nearly 30M of virtual space, so I loaded the ELC up on memory.

Trouble is, it crashes randomly.

My first suspicion would be flaky hardware, probably memory.  But I've
been swapping memory, and it still crashes, even on completely
different memory (I've tried two 16M SIMMs = 32M, I've tried 3x4M+1x16M
(a different 16M), I've tried various other combinations...none of them
help).

I've tried a total of three different CPU boards.  No visible
difference.

This leaves only two possibilities, to my mind: (1) bugs in the program
itself and (2) bugs in the kernel.  I'm inclined to discount (1) as a
possibility because the crashes are almost all "impossible" crashes
(stuff like strcpy from a nil pointer, when the strcpy call is
protected by an if testing the pointer); also, the same code seems
happy elsewhere (heavily stressed on a non-NetBSD 68k machine; less
heavily stressed on a SunOS SPARC).  And I can't see how (2) could do
it either, since most of the tests are done with enough memory that
there's no way it's even thinking about paging.  Usually the stack is
complete garbage, as if the stack pointer got trashed.  I suppose it
could be failing to restore registers correctly on a context switch,
but I can't see how a bug like that would go undetected for so long;
besides, I have another NetBSD/sparc machine that pounds much more
heavily on the context switch code and it shows no such symptoms.

Which leaves me with no hypotheses.  Does anyone have *any* ideas?
Anyone with experience tracking down this sort of very-intermittent
trouble?

					der Mouse

			       mouse@rodents.montreal.qc.ca
		     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B