Subject: Re: Isolating NMI/memory problem with old SPARCserver 20
To: None <port-sparc@NetBSD.org>
From: Greg Earle <earle@isolar.DynDNS.ORG>
List: port-sparc
Date: 08/06/2005 14:26:28
On Aug 6, 2005, at 1:25 PM, Havard Eidnes wrote:
>> And, if so, how do I map it to the bad DIMM module?
>
> Upgrade to a newer version of NetBSD? ;-)

Its next "upgrade" will be to retirement, as soon as I
move its functions over to my dual-450 Ultra 60.  So that
isn't really an option ...

>> (Update: I just saw an old post to port-sparc from May 9th from
>> Malte Dehling; he reported a similar error, but his log also
>> shows a "module location: " identifier?  Mine doesn't - is this
>> a new reporting feature in NetBSD 2.0 or something?)
>
> Yes.  The code was added in revision 1.8 of memecc.c on 22 Mar 2004.
>
> Index: memecc.c
> ===================================================================
> RCS file: /u/nb/src/sys/arch/sparc/sparc/memecc.c,v
> retrieving revision 1.7
> retrieving revision 1.8
> diff -u -r1.7 -r1.8
> --- memecc.c    15 Jul 2003 00:05:06 -0000      1.7
> +++ memecc.c    22 Mar 2004 12:37:43 -0000      1.8
> @@ -142,6 +142,8 @@
>         printf("\tMBus transaction: %s\n",
>                 bitmask_snprintf(efar0, ECC_AFR_BITS, bits, 
> sizeof(bits)));
>         printf("\taddress: 0x%x%x\n", efar0 & ECC_AFR_PAH, efar1);
> +       printf("\tmodule location: %s\n",
> +               prom_pa_location(efar1, efar0 & ECC_AFR_PAH));
>
>         /* Unlock registers and clear interrupt */
>         bus_space_write_4(memecc_sc->sc_bt, bh, ECC_FSR_REG, efsr);
>
> However, that came with another set of changes, the
> source-changes message was:
>
> Module Name:    src
> Committed By:   pk
> Date:           Mon Mar 22 12:37:43 UTC 2004
>
> Modified Files:
>         src/sys/arch/sparc/include: promlib.h
>         src/sys/arch/sparc/sparc: memecc.c memreg.c promlib.c
>
> Log Message:
> Leverage the PROM's ability to identify the on-board location of a
> physical memory address.
>
> To generate a diff of this commit:
> cvs rdiff -r1.18 -r1.19 src/sys/arch/sparc/include/promlib.h
> cvs rdiff -r1.7 -r1.8 src/sys/arch/sparc/sparc/memecc.c
> cvs rdiff -r1.37 -r1.38 src/sys/arch/sparc/sparc/memreg.c
> cvs rdiff -r1.31 -r1.32 src/sys/arch/sparc/sparc/promlib.c
>
> For a quick try, you could perhaps try to add those changes to
> your local source tree and run that kernel?  (It's not a given
> that this doesn't depend on some other change, but it's worth a
> try.)  That is, if your machine stays up long enough for you to
> patch and compile a new kernel...

It stays up; these aren't fatal.  And they're only occasional.

More of a problem is the fact that my versions of these 4 files
are ancient compared to the ones you've mentioned:

==> src/sys/arch/sparc/include/promlib.h <==
/*      $NetBSD: promlib.h,v 1.4 2001/09/26 20:53:07 eeh Exp $ */

==> src/sys/arch/sparc/sparc/memecc.c <==
/*      $NetBSD: memecc.c,v 1.3 2002/03/11 16:27:04 pk Exp $    */

==> src/sys/arch/sparc/sparc/memreg.c <==
/*      $NetBSD: memreg.c,v 1.32 2002/03/11 16:27:04 pk Exp $ */

==> src/sys/arch/sparc/sparc/promlib.c <==
/*      $NetBSD: promlib.c,v 1.13 2001/12/07 11:00:39 hannken Exp $ */

So I'm a bit afraid that these 4 diffs won't just drop right in ...
(I suppose I can try it and see, though)

>> I've got 3 64 MB DIMMs (in banks 0, 1 and 5) for a total of 192 MB,
>> so I could live without one of 'em temporarily ... what's weird is
>> that I did a "test-memory" from the boot PROM (with "selftest-#megs?"
>> set to all 192 MB) as well as booting in diag mode and having it test
>> memory there as well, and it didn't hiccup on that address ...
>
> It's not certain that the memory test in the prom is all that
> thorough.  It could also be heat-related, as someone else
> commented.

The odd thing about that (thanks for the suggestions, btw) is
that the machine is in a small room with a window unit A/C,
so theoretically it should never be getting all that hot.

Thanks,

	- Greg