Subject: Re: Isolating memory error
To: None <port-sparc@NetBSD.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: port-sparc
Date: 09/09/2003 04:52:52
[my best guess at incorrect linebreaks removed -dM]
> Sep  8 11:45:37 isolar /netbsd: NMI: system interrupts: 10000000<VME=0,SBUS=0,M>
> Sep  8 11:45:37 isolar /netbsd: memory error:
> Sep  8 11:45:37 isolar /netbsd:         EFSR: 6e01<CE,DW=0,SYNDROME=6e>
> Sep  8 11:45:37 isolar /netbsd:         MBus transaction: 8fffcd50<VAH=0,TYPE=5,SIZE=5,C,VA=ff,S,MID=8>
> Sep  8 11:45:37 isolar /netbsd:         address: 0x01e76b89c

While the chronological coincidence of this happening soon after adding
memory is disquieting, it could be pure coincidence.  The error was a
correctible error ("CE") and presumably was corrected.  (The doubleword
index (0) and syndrome value (6e) might be of use to you if you were
trying to debug the ECC hardware, or identify exactly which memory cell
were responsible, but it seems unlikely that you care about identifying
the responsible hardware to any finer granularity than "which SIMM".)

If it doesn't happen again, I'd be inclined to ignore it.  If it
happens again, especially at the same address, some RAM may need
reseating or replacing.

> Any way to isolate this to the affected SIMM?

0x01e76b89c says which SIMM, though not as pellucidly as you might
wish: it's the one containing that physical address.

My best guess at working out which SIMM it is follows.  I am not at all
sure I have this right; if someone knows better, please correct me.

On a 20, SIMMs appear 64MB (the max SIMM size) apart.  Their base
addresses are multiples of 04000000.  The address you cite,
0x01e76b89c, is 7*0x04000000 + 0x276b89c.  Thus, counting SIMMs from 0,
it is in SIMM number 7 (which is obviously a 64M SIMM, since it
contains an address above the 32M mark).  I don't know whether the 20's
SIMM socket numbers bear a simple relation to their physical locations,
but if you turn on diag mode in the ROMs ("setenv diag-mode? true" or
"setenv diag-switch? true" or some such - check "printenv") you'll get
a dump of what memory is present near the end of the POST.  You can
then find out which one it is by pulling SIMMs until number 7 is
reported empty on power-up.  I have memories from adding memory to my
own 20 which may be incorrect after this long (it was months ago) but,
if they are correct, indicate that sockets 6 and 7 are the two nearest
the SBus connectors, the two with the small extra VSIMM socket added on
to the side.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B