Subject: Re: Memory errors. Maybe.
To: None <port-amd64@netbsd.org>
From: Richard Rauch <rkr@olib.org>
List: port-amd64
Date: 09/15/2004 04:45:13
On Wed, Sep 15, 2004 at 01:55:48AM -0500, Richard Rauch wrote:
 [...]

I've let memtest86+ run for a while to build up some stats.
(Also, for a little bit yet, I am a bit hampered at getting
the self-extracting BIOS upgrade images into a WIN32 box
to expand them...)

The information may be of some interest to someone so I'm
posting.  And then, it may also remind someone of a problem
that they've had with AMD64 systems and we might find some
common points.


>  b) Errors seem to always come in pairs.  The second one is
>     always approx. 16MB after the first one.  Usually (always?)
>     the second digit is 2 less than the first digit.

 b.1) The difference is (16MB - 0x20).

 b.2) If the second in a pair of errors falls close enough to
      the other (less than a meg, certainly), I saw at least one
      case where you get errors at A, A+16MB-0x20, A+16MB-0x20+<fudge>,
      where <fudge> is some small value---i.e., a triple of errors.
      I only saw a case of this happen at one point.  I know it must
      have happened on other occasions, as the total number of errors
      was odd then even.  I don't know how far apart the "odd man out"
      was from the nearest other error.  This suggests to me a cache
      issue.  (That's a little easier for me to believe than that
      I've had two consective bad memory modules.)


>  c) The "Good" column in memtest86+ is usually all 0s, the
>     "Bad" is 00000004 (or 0 if "Good" was 4), and the Err-Bits
>     is always 00000004.

 c.1) The Err-bits is always all 0s it seems except for a single 4
      digit.  I saw one case where a fairly high-order f nybble
      became a b nybble.

 c.2) "Bad" is almost always presented with 1 extra bit turned on,
      rather than off.

>  d) As the tests run multiple times, they seem to have started
>     at 10, spiked a bit on the second or third pass, and in
>     passes 4, 5, and 6 have dropped off to 4 to 6 errors.  Maybe
>     that's just statistical anomaly, but suggests to me some kind
>     of "warming up" and stabilizing of something...

 d.1) As the test has continued (up to almost 4 hours now), there
      have been returns to larger numbers of errors per pass and
      some passes with no errors.  (Currently at 22 passes, it has
      194 errors logged, so about 8 errors per pass it now seems.)


-- 
  "I probably don't know what I'm talking about."  http://www.olib.org/~rkr/