Subject: Re: Memory errors.  Maybe.
To: None <port-amd64@netbsd.org>
From: Richard Rauch <rkr@olib.org>
List: port-amd64
Date: 09/17/2004 07:44:29
For closure of the thread:
It had been suggested in the past that these problems appeared to
be memory related.  I was incredulous that (a) I was so unlucky to
have had 2 memory modules go bad in succession, and (b) that the
memory that had worked 100% when new had suddenly failed after a
couple of months.  Because of alleged BIOS issues, and not-so-warm
views of the nForce3 that I was getting from some other quarters,
I was reluctant to throw $100 at more memory if I wasn't sure that
the problem was not a CPU or motherboard/BIOS issue.
Off-list, Mike Cheponis very helpfully pointed out that such failure
can occur even after a few months, if the memory falls slightly
out of spec for the performance that it is supposed to maintain.
Mike suggested a fan for the memory.  I was lacking the means to
construct a mounting bracket myself for a regular muffin fan and
unable to get a memory-specific fan off-the-(local)-shelf.  I
settled with a $9 heat-sink.
Even without the thermal paste, the sink seems to have made a
difference.  After 2 complete passes, and ~90% of a third pass
(over 9.5 hours), there are no reported errors.
I'll let it continue a while longer, and then kill the system, put
the thermal paste in the heatsink, and then bring it back up.
Many thanks to Mike for not only diagnosing it as a definite memory
problem, but also diagnosing (or at least correctly theorizing)
the nature of the memory problem and suggesting a very cheap way
to fix it.  (Seeing as I have no other DDR 400 systems, I was
reluctant to buy a new memory module for this, if it wouldn't solve
my problem.)  I'm not sure if the extra cooling is going to be
sufficient, or if the memory is just going to continue to degrade.
I assume that the drop-off asymptotically stabilizes and it will
probably not get much worse.
Others had suggested memory, but were unable to explain why the
module would fail when it was so new, and offered no ideas better
than throwing much more money at a new memory module.
-- 
  "I probably don't know what I'm talking about."  http://www.olib.org/~rkr/