Subject: Re: Memory errors. Maybe.
To: None <port-amd64@netbsd.org>
From: Richard Rauch <rkr@olib.org>
List: port-amd64
Date: 09/15/2004 04:45:13
On Wed, Sep 15, 2004 at 01:55:48AM -0500, Richard Rauch wrote:
[...]
I've let memtest86+ run for a while to build up some stats.
(Also, for a little bit yet, I am a bit hampered at getting
the self-extracting BIOS upgrade images into a WIN32 box
to expand them...)
The information may be of some interest to someone so I'm
posting. And then, it may also remind someone of a problem
that they've had with AMD64 systems and we might find some
common points.
> b) Errors seem to always come in pairs. The second one is
> always approx. 16MB after the first one. Usually (always?)
> the second digit is 2 less than the first digit.
b.1) The difference is (16MB - 0x20).
b.2) If the second in a pair of errors falls close enough to
the other (less than a meg, certainly), I saw at least one
case where you get errors at A, A+16MB-0x20, A+16MB-0x20+<fudge>,
where <fudge> is some small value---i.e., a triple of errors.
I only saw a case of this happen at one point. I know it must
have happened on other occasions, as the total number of errors
was odd then even. I don't know how far apart the "odd man out"
was from the nearest other error. This suggests to me a cache
issue. (That's a little easier for me to believe than that
I've had two consective bad memory modules.)
> c) The "Good" column in memtest86+ is usually all 0s, the
> "Bad" is 00000004 (or 0 if "Good" was 4), and the Err-Bits
> is always 00000004.
c.1) The Err-bits is always all 0s it seems except for a single 4
digit. I saw one case where a fairly high-order f nybble
became a b nybble.
c.2) "Bad" is almost always presented with 1 extra bit turned on,
rather than off.
> d) As the tests run multiple times, they seem to have started
> at 10, spiked a bit on the second or third pass, and in
> passes 4, 5, and 6 have dropped off to 4 to 6 errors. Maybe
> that's just statistical anomaly, but suggests to me some kind
> of "warming up" and stabilizing of something...
d.1) As the test has continued (up to almost 4 hours now), there
have been returns to larger numbers of errors per pass and
some passes with no errors. (Currently at 22 passes, it has
194 errors logged, so about 8 errors per pass it now seems.)
--
"I probably don't know what I'm talking about." http://www.olib.org/~rkr/