Subject: Re: Isolating NMI/memory problem with old SPARCserver 20
To: None <earle@isolar.DynDNS.ORG>
From: Havard Eidnes <he@NetBSD.org>
List: port-sparc
Date: 08/06/2005 22:25:47
> This is just a memory DIMM going bad, right?

Most probably.

> And, if so, how do I map it to the bad DIMM module?

Upgrade to a newer version of NetBSD? ;-)

> (Update: I just saw an old post to port-sparc from May 9th from
> Malte Dehling; he reported a similar error, but his log also
> shows a "module location: " identifier?  Mine doesn't - is this
> a new reporting feature in NetBSD 2.0 or something?)

Yes.  The code was added in revision 1.8 of memecc.c on 22 Mar 2004.

Index: memecc.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /u/nb/src/sys/arch/sparc/sparc/memecc.c,v
retrieving revision 1.7
retrieving revision 1.8
diff -u -r1.7 -r1.8
--- memecc.c    15 Jul 2003 00:05:06 -0000      1.7
+++ memecc.c    22 Mar 2004 12:37:43 -0000      1.8
@@ -142,6 +142,8 @@
        printf("\tMBus transaction: %s\n",
                bitmask_snprintf(efar0, ECC_AFR_BITS, bits, sizeof(bits=
)));
        printf("\taddress: 0x%x%x\n", efar0 & ECC_AFR_PAH, efar1);
+       printf("\tmodule location: %s\n",
+               prom_pa_location(efar1, efar0 & ECC_AFR_PAH));
 =

        /* Unlock registers and clear interrupt */
        bus_space_write_4(memecc_sc->sc_bt, bh, ECC_FSR_REG, efsr);

However, that came with another set of changes, the
source-changes message was:


Module Name:    src
Committed By:   pk
Date:           Mon Mar 22 12:37:43 UTC 2004

Modified Files:
        src/sys/arch/sparc/include: promlib.h
        src/sys/arch/sparc/sparc: memecc.c memreg.c promlib.c

Log Message:
Leverage the PROM's ability to identify the on-board location of a
physical memory address.


To generate a diff of this commit:
cvs rdiff -r1.18 -r1.19 src/sys/arch/sparc/include/promlib.h
cvs rdiff -r1.7 -r1.8 src/sys/arch/sparc/sparc/memecc.c
cvs rdiff -r1.37 -r1.38 src/sys/arch/sparc/sparc/memreg.c
cvs rdiff -r1.31 -r1.32 src/sys/arch/sparc/sparc/promlib.c


For a quick try, you could perhaps try to add those changes to
your local source tree and run that kernel?  (It's not a given
that this doesn't depend on some other change, but it's worth a
try.)  That is, if your machine stays up long enough for you to
patch and compile a new kernel...

> I've got 3 64 MB DIMMs (in banks 0, 1 and 5) for a total of 192 MB,
> so I could live without one of 'em temporarily ... what's weird is
> that I did a "test-memory" from the boot PROM (with "selftest-#megs?"=

> set to all 192 MB) as well as booting in diag mode and having it test=

> memory there as well, and it didn't hiccup on that address ...

It's not certain that the memory test in the prom is all that
thorough.  It could also be heat-related, as someone else
commented.

Regards,

- H=E5vard