Subject: memory or CPU bad on SS20?
To: None <port-sparc@netbsd.org>
From: D G Teed <donald.teed@gmail.com>
List: port-sparc
Date: 09/09/2007 20:15:11
Hello,

As a previous email mentioned, I get core dumps from apcupsd
occassionally.  I now suspect there is a hardware or possibly
a kernel issue.

I've seen this error a couple of times in dmesg in the last week:

module0:
        mxcc error 0x0
        mxcc status 0xff1410002
        mxcc reset 0x0
module1:
        mxcc error 0xb30010014fc8080
        mxcc status 0xff1402000
        mxcc reset 0x0

Once the above happened during a tar/gunzip of a package make.
Today it happened during a run of memtester
with an argument of 16.

memtester identifed some errors minutes after that:

FAILURE: 0xa47b0704 != 0xa47b0504 at offset 0x00021c23.
FAILURE: 0x03ea6504 != 0x03ea6704 at offset 0x00031423.
FAILURE: 0x7d866504 != 0x7d866704 at offset 0x00039423.

A little later in the console, after further error-free progress
from memtester, there was a kernel panic:

store buffer copy-back failure at 0x65028. Retrying...
data fault: pc=0xf02b66d0 addr=0xf0d7308c sfsr=5336<PERR=2,UC,LVL=3,AT=1,FT=5,F>
panic: kernel fault
syncing disks...

If I do test /memory from the OBP, there are no errors.
If I setenv "diag-switch?          true"  in boot prom
I can't see any errors from memory or other devices
while on the serial console.  It always seems to boot up fine too.

Is it likely I have a faulty CPU module?  My CPUs are identified as
thus on boot up:

cpu0 at mainbus0: mid 8: TMS390Z50 v0 or TMS390Z55 @ 75 MHz, on-chip FPU
cpu0: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K external (32 d
cpu1 at mainbus0: mid 10: TMS390Z50 v0 or TMS390Z55 @ 75 MHz, on-chip FPU
cpu1: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K external (32 d
obio0 at mainbus0

I have a couple of others from a spare box, but they are likely
different models.

Are there suggestions on what the next step is to determine where the
problem originates?

--Donald