Subject: Re: What does this error mean?
To: None <kpneal@pobox.COM>
From: Aaron Brown <abrown@eecs.harvard.edu>
List: port-sparc
Date: 01/30/1997 23:57:07
[sorry for any dupes...forgot to cc correctly]

> SPARCstation 10 running 1.2:
> 
> ERROR: got NMI with sfsr=0x0, sfva=0xf201c, afsr=0x0, afaddr=0x0. Retrying...

This is bad. It could mean the memory is toast; in any case it is a 
non-maskable interrupt, and never ever should show up during normal 
hardware operation.

> Also, what's causing random processes to core dump with various signals
> (4,10,11) to name a few? 

I'll ask my usual question: does it have a L2 (external) cache (check
dmesg). 1.2 (and -current) can't handle the cache yet on machines with
supersparcs and no L2 cache.

Does it have more than one CPU?

This may also be the result of kernel stack overflow during autoconfig. Do
you have any unusual devices (serial port cards, scsi or ethernet cards)
in the system?

> Why can't I compile a kernel? repeat 100 make just gives me a steady stream
> of compiles, then core dumps. Right now it's hung on one of the nfs files,
> it can't compile it because it gets core dumps.
> 
> Oh look, make clean just won't work. More core dumps.
> 
> And my networking is broken as well. The ethernet just won't work. If I
> netboot the machine, it works fine. If, while netbooted, I fsck, then the
> ethernet goes straight to hell (spewing errors with just about every keypress).
> If I boot from the disk then ethernet is totally nonfunctional (errors).

This I've never seen. memory errors from le0 or ledma0 errors usually mean
the hardware's broken or unsupported.

> A little while ago ppp just up and quit. Rebooting doesn't help. 

Are your shared libs hosed? That would certainly cause coredumps. Try
reinstalling them.

> What causes a le0: memory error?

It's just what's reported by the kernel when the le interface sets the
"memory error" status bit; I don't know any more than that. The interface
could be bad...

> I just got the machine for Christmas. It ran SunOS 4 with no problems, and
> Solaris was fine enough to throw NetBSD onto the disk. Then again, that was
> running from the other disk. Could there be a problem with this disk? Then
> again, I haven't started swapping yet. 

I've had problems with one internal Sun 1G disk in that files randomly
appeared and disappeared and became corrupt. Although I don't think
this would explain the other problems.

> Can somebody at least tell me if my machine has a physical problem, or is
> 1.2 for the 4m just unstable? It passes all of it's self tests.

It sounds like a physical problem. This is the first time I've heard of
any of these problems on an SS10 (unless it is lacking the external
ecache). Could you post the dmesg output?

1.2 is rock solid on my SS20; it's been up for about 100 days without
any problems.

--Aaron