Subject: Memory errors. Maybe.
To: None <port-amd64@netbsd.org>
From: Richard Rauch <rkr@olib.org>
List: port-amd64
Date: 09/15/2004 01:55:48
This is a point of curiosity moreso than an immediate request for
answers.  I should be able to cross check (finally) on this issue in
a day or so.  At that point, I hope to know whether what I need to do
is replace the system memory or look at it as a motherboard issue.



Here's the situation: I have an nVidia nForce3 chipset on my motherboard.
This has apparently had problems.  (It's a 150 revision, or whatever...)

NetBSD has problems.  GNU/LINUX has fewer.  The GNU/LINUX mentions an
unresolved BIOS issue, recommends upgrading BIOS, and claims to be
implementing a workaround, but warns that there may still be segfaults.
Even with that unspecified workaround, I've had GNU/LINUX crash.

When I cited this issue on this list before, someone initially said
they thought it wasn't serious, but after investigating thought that
it was more serious than they thought at first.

My motherboard's BIOS did not have an update available since January
or before.  One has recently become available, and I'll be trying
that to see what happens.


I've also run memtest86+ on the system.  (The older memtest86
would not run on this machine.)

The results of memtest86+ are interesting.

On most tests, all looks okay.

On test number 5 ("Block move, 64 moves, cached"), I get about 10
errors on each pass.  The errors are at different addresses each
time through, and there are different numbers of errors.  But there
are some interesting commonalities:

 a) Errors are always at addresses ending in 0x8.  The
    second digit is always even, I think.  (All digits in
    hex.)
 b) Errors seem to always come in pairs.  The second one is
    always approx. 16MB after the first one.  Usually (always?)
    the second digit is 2 less than the first digit.
 c) The "Good" column in memtest86+ is usually all 0s, the
    "Bad" is 00000004 (or 0 if "Good" was 4), and the Err-Bits
    is always 00000004.
 d) As the tests run multiple times, they seem to have started
    at 10, spiked a bit on the second or third pass, and in
    passes 4, 5, and 6 have dropped off to 4 to 6 errors.  Maybe
    that's just statistical anomaly, but suggests to me some kind
    of "warming up" and stabilizing of something...

These behaviors make me wonder if I am looking at a bad motherboard.
Are there known (software-fixable) motherboard/BIOS issues that could
cause this kind of thing?  Or should I be asking the memtest86+ people?
Could memory failure really cause problems like this?


Another data point is that with the original BIOS and some NetBSD from
late last year or early this year, the system was *rock* solid.  If this
is a hardware problem, it is one that required some burn-in to set in.

I've also already replaced RAM in this box.  (The first time it did
help...but at the time, I did not have memtest86+, so I do not know
if the above problems would have shown up...  The RAM was under
warranty and I just exchanged it.)


Again, I hope to find out more over the next 24 hours or so: I will
upgrade the system BIOS here, and hope to test this memory in another
computer.  (I have about 6 machines, perhaps 2 of which actually
take the same kind of memory as one another; none use the same kind
as this AMD64.  But I have a friend who I think has a machine that
takes the kind of memory in this machine.)

This is just a query to see if anyone has any interesting insight
into this problem as I describe it.


Thanks in advance for any responses.


-- 
  "I probably don't know what I'm talking about."  http://www.olib.org/~rkr/