Subject: Re: Diagnosing dying hardware -- any suggestions?
To: Steven M. Bellovin <smb@cs.columbia.edu>
From: Michal Suchanek <hramrach@centrum.cz>
List: port-i386
Date: 10/20/2006 22:27:40
On 10/20/06, Steven M. Bellovin <smb@cs.columbia.edu> wrote:
> On Fri, 20 Oct 2006 07:48:15 -0700, buhrow@lothlorien.nfbcal.org (Brian
> Buhrow) wrote:
>
> > Any ideas would be greatly appreciated, especially if someone can point to
> > something and say "this means ram, this other thing means cache chips on
> > board the motherborad, etc."
> >
> Have you run memtest86 or memtest86+?  They'll tell you the failing
> address.  (If I recall correctly, one of them will emit the list of
> failing addresses in a form that Linux can read and honor (a nice OS
> feature, I might add...)
>
> Even if you can't map the addresses reported directly to memory sticks,
> you can try removing a stick or two at a time and rerunning the
> diagnostic.  Or, as you note, see if the failing address moves around if
> you run multiple passes, or as you rearrange the layout.
>
Memtest is also useful in diagnosing where the error may be. If there
are few errors around one or few places (that possibly do not show on
each pass) you probably got a bad ram.
If there are many errors all over the place (especially errors with
the same pattern) you probably got chipset error/misconfiguration.
If memtest locks up as well it is probably the board or cpu cooler.

Thanks

Michal