Subject: Re: Diagnosing dying hardware -- any suggestions?
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Steven M. Bellovin <smb@cs.columbia.edu>
List: port-i386
Date: 10/20/2006 11:13:41
On Fri, 20 Oct 2006 07:48:15 -0700, buhrow@lothlorien.nfbcal.org (Brian
Buhrow) wrote:

> 	Hello.  I have a relatively new P4 machine with 2GB of RAM which is
> running in  a production environment under NetBSD-3.0_stable with sources
> around mid January 2006.  Lately, it's begun panicing with uvm_faults and
> illegal page faults and other spurious error messages.  I'm certain the
> hardware is at fault, but now the question is, does the problem lie with
> the memory sticks in the machine, 4 512MB sticks, or does it lie on the
> motherboard itself.  I've tried blowing the dust out of the board, and
> reseating the memory sticks, and also rearanging their order, but the
> mis-behavior seems the same.
> 	So, what I'm wondering is if anyone can tell me, given a few samples
> of the output below, if it's probable that the trouble is with the RAM or
> with the board.  I'm assuming that if a given memory stick was bad, and
> it's now in a different place in the physical lineup of RAM, that perhaps
> the character of the faulting address would change, such that it is
> possible to say that the error moved, and thus it is RAM.
> Any ideas would be greatly appreciated, especially if someone can point to
> something and say "this means ram, this other thing means cache chips on
> board the motherborad, etc."
> 
Have you run memtest86 or memtest86+?  They'll tell you the failing
address.  (If I recall correctly, one of them will emit the list of
failing addresses in a form that Linux can read and honor (a nice OS
feature, I might add...)

Even if you can't map the addresses reported directly to memory sticks,
you can try removing a stick or two at a time and rerunning the
diagnostic.  Or, as you note, see if the failing address moves around if
you run multiple passes, or as you rearrange the layout.

		--Steven M. Bellovin, http://www.cs.columbia.edu/~smb