Subject: Re: SS20 SMP panic
To: Tillman Hodgson <tillman@seekingfire.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: port-sparc
Date: 01/17/2005 20:04:22
On Mon, Jan 17, 2005 at 07:48:38AM -0600, Tillman Hodgson wrote:
> On Sun, Jan 16, 2005 at 12:02:45AM -0600, Tillman Hodgson wrote:
> > The last thing logged in /var/log/messages:
> > 
> > Jan 15 23:17:06 surya /netbsd: Async registers (mid 9): afsr=3c00<SE,UC,TO,BE,AFA=0>; afva=0x00
> > Jan 15 23:17:06 surya /netbsd: Async registers (mid 8): afsr=3c00<SE,UC,TO,BE,AFA=0>; afva=0x00
> > Jan 15 23:17:06 surya /netbsd: nmi_hard: SMP botch.cpu0: NMI: system interrupts: 10090000<VME=0,SBUS=0,E,T,M>
> 
> The machine died again at 3:13 last night:
> 
> Jan 17 03:15:01 surya /netbsd: Async registers (mid 9): afsr=3c00<SE,UC,TO,BE,AFA=0>; afva=0x00
> Jan 17 03:15:01 surya /netbsd: Async registers (mid 8): afsr=3c00<SE,UC,TO,BE,AFA=0>; afva=0x00
> Jan 17 03:15:01 surya /netbsd: cpu0: NMI: system interrupts: 10080000<VME=0,SBUS=0,T,M>
> Jan 17 03:15:01 surya /netbsd: memory error:
> 
> Oddly, it was still responding to pings.
> 
> The 3:15 time is suspicious because that's when the daily scripts run,
> and I have a /etc/daily.local that performs a backup on the disk via
> a gzip'ed tar to a second disk. Gzip would use a lot of CPU time, so
> both times have been when the CPUs are very busy.
> 
> I'm not sure what to make of the error messages. A memory error would
> seem to be a different thing than a CPU problem, unless perhaps it's a
> cache problem. In any case, if anyone can read the error messages well
> enough to tell whether I ought to be pulling RAM sticks until the
> problem goes away or swapping the CPUs out I'd be most appreciative :-)

Hum, It's possible that those NMI are generated because of RAM ECC
failing.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--