port-sparc: Re: What does this error mean?

Subject: Re: What does this error mean?
To: Aaron Brown <abrown@eecs.harvard.edu>
From: Kevin P. Neal <kpneal@pobox.com>
List: port-sparc
Date: 01/31/1997 00:57:10
At 11:55 PM 1/30/97 -0500, Aaron Brown wrote:
>> SPARCstation 10 running 1.2:
>> 
>> ERROR: got NMI with sfsr=0x0, sfva=0xf201c, afsr=0x0, afaddr=0x0. Retrying...
>
>This is bad. It could mean the memory is toast; in any case it is a 
>non-maskable interrupt, and never ever should show up during normal 
>hardware operation.

Could? The memory passes the system self tests. Can you be sure of this?

Right now I have 3 16mb SIMMs. If I swap them around, I get no change in
problems. If I pull some out, the frequence of core dumps increases.

This particular NMI happened with 1 SIMM.

Bad system board?

>> Also, what's causing random processes to core dump with various signals
>> (4,10,11) to name a few? 
>
>I'll ask my usual question: does it have a L2 (external) cache (check
>dmesg). 1.2 (and -current) can't handle the cache yet on machines with
>supersparcs and no L2 cache.

On power up:

CPU_#0   TI, TMS390Z55(3.0)  1Mb External cache

CPU_#1   ***** NOT installed *****
CPU_#2   same
CPU_#3   same

The banner contains:
SPARCstation 10 (1 X 390Z55)
ROM Rev 2.25

>Does it have more than one CPU?

Nope.

>This may also be the result of kernel stack overflow during autoconfig. Do
>you have any unusual devices (serial port cards, scsi or ethernet cards)
>in the system?

None.

>> Why can't I compile a kernel? repeat 100 make just gives me a steady stream
>> of compiles, then core dumps. Right now it's hung on one of the nfs files,
>> it can't compile it because it gets core dumps.
>> 
>> Oh look, make clean just won't work. More core dumps.
>> 
>> And my networking is broken as well. The ethernet just won't work. If I
>> netboot the machine, it works fine. If, while netbooted, I fsck, then the
>> ethernet goes straight to hell (spewing errors with just about every
keypress).
>> If I boot from the disk then ethernet is totally nonfunctional (errors).
>
>This I've never seen. memory errors from le0 or ledma0 errors usually mean
>the hardware's broken or unsupported.

That's not good.

>> A little while ago ppp just up and quit. Rebooting doesn't help. 
>
>Are your shared libs hosed? That would certainly cause coredumps. Try
>reinstalling them.

Will do....tomorrow.

>> What causes a le0: memory error?
>
>It's just what's reported by the kernel when the le interface sets the
>"memory error" status bit; I don't know any more than that. The interface
>could be bad...
>

*sigh*

>
>> Can somebody at least tell me if my machine has a physical problem, or is
>> 1.2 for the 4m just unstable? It passes all of it's self tests.
>
>It sounds like a physical problem. This is the first time I've heard of
>any of these problems on an SS10 (unless it is lacking the external
>ecache). Could you post the dmesg output?

Copyright (c) 1982, 1986, 1989, 1991, 1993
	The Regents of the University of California.  All rights reserved.

NetBSD 1.2 (GENERIC) #4: Fri Sep 27 22:03:25 MET DST 1996
    pk@kwik:/usr/src1/sys/arch/sparc/compile/GENERIC
real mem = 49541120
avail mem = 44441600
using 604 buffers containing 2473984 bytes of memory
bootpath: /iommu@f,e0000000/sbus@f,e0001000/espdma@f,400000/esp@f,800000/sd@3,0
mainbus0 (root): SUNW,SPARCstation-10
cpu0 at mainbus0: TI,TMS390Z55 @ 50 MHz, on-chip FPU
cpu0: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K external
(32 b/l) cache enabled
obio0 at mainbus0
clock0 at obio0 addr 0xf1200000: mk48t08 (eeprom)
timer0 at obio0 addr 0xf1300000 delay constant 23
auxreg0 at obio0 addr 0xf1800000
zs0 at obio0 addr 0xf1100000 pri 12, softpri 6
zs0a: console i/o
zs1 at obio0 addr 0xf1000000 pri 12, softpri 6
fdc0 at obio0 addr 0xf1700000 pri 11, softpri 4: chip 82077
fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec
power0 at obio0 addr 0xf1a01000
iommu0 at mainbus0 ioaddr 0xe0000000: version 3/0, page-size 4096, range 64MB
sbus0 at iommu0: clock = 20 MHz
dma0 at sbus0 slot 15 offset 0x400000: rev 2
esp0 at dma0 slot 0xf offset 0x800000 pri 4: ESP200 40Mhz, target 7
scsibus0 at esp0
sd0 at scsibus0 targ 0 lun 0: <SEAGATE, ST32430N, 0510> SCSI2 0/direct fixed
sd0: 2049MB, 3992 cyl, 9 head, 116 sec, 512 bytes/sec
sd1 at scsibus0 targ 3 lun 0: <SEAGATE, ST31200N SUN1.05, 8722> SCSI2
0/direct fixed
sd1: 1006MB, 2700 cyl, 9 head, 84 sec, 512 bytes/sec
ledma0 at sbus0 slot 15 offset 0x400010: rev 2
le0 at ledma0 slot 0xf offset 0xc00000 pri 6: address 08:00:20:1d:21:e4
le0: 8 receive buffers, 2 transmit buffers
SUNW,bpp at sbus0 slot 15 offset 0x4800000 not configured
SUNW,DBRIe at sbus0 slot 15 offset 0x8010000 not configured
root on sd1a

>1.2 is rock solid on my SS20; it's been up for about 100 days without
>any problems.

Ah.

Is anybody else less than impressed with the diagnostics? I mean, it passes it's
self tests, yet still loses. 

Thanks for the assistance.
--
XCOMM Kevin P. Neal, Junior, Comp. Sci.    -   kpneal@pobox.com
XCOMM  House of Retrocomputing:            -   kpneal@eos.ncsu.edu
XCOMM     http://www.pobox.com/~kpn/       -   kevinneal@bix.com
XCOMM "Rebooting with command:" -- SPARCstation 10 boot prom