Subject: Re: memory or CPU bad on SS20?
To: None <port-sparc@netbsd.org>
From: D G Teed <donald.teed@gmail.com>
List: port-sparc
Date: 09/17/2007 22:40:43
Well, just to clear the suspicions, I've swapped the OS drive
into another SS20 with dual CPUs and it runs solid.

It was crashing with little of a pattern previously.
Must be a bad CPU module in the first system.
Both are similar except the second has slightly
slower 60 Mhz CPU (both have cache).

On 9/9/07, D G Teed <donald.teed@gmail.com> wrote:
> Hello,
>
> As a previous email mentioned, I get core dumps from apcupsd
> occassionally.  I now suspect there is a hardware or possibly
> a kernel issue.
>
> I've seen this error a couple of times in dmesg in the last week:
>
> module0:
>         mxcc error 0x0
>         mxcc status 0xff1410002
>         mxcc reset 0x0
> module1:
>         mxcc error 0xb30010014fc8080
>         mxcc status 0xff1402000
>         mxcc reset 0x0
>
> Once the above happened during a tar/gunzip of a package make.
> Today it happened during a run of memtester
> with an argument of 16.
>
> memtester identifed some errors minutes after that:
>
> FAILURE: 0xa47b0704 != 0xa47b0504 at offset 0x00021c23.
> FAILURE: 0x03ea6504 != 0x03ea6704 at offset 0x00031423.
> FAILURE: 0x7d866504 != 0x7d866704 at offset 0x00039423.
>
> A little later in the console, after further error-free progress
> from memtester, there was a kernel panic:
>
> store buffer copy-back failure at 0x65028. Retrying...
> data fault: pc=0xf02b66d0 addr=0xf0d7308c sfsr=5336<PERR=2,UC,LVL=3,AT=1,FT=5,F>
> panic: kernel fault
> syncing disks...
>
> If I do test /memory from the OBP, there are no errors.
> If I setenv "diag-switch?          true"  in boot prom
> I can't see any errors from memory or other devices
> while on the serial console.  It always seems to boot up fine too.
>
> Is it likely I have a faulty CPU module?  My CPUs are identified as
> thus on boot up:
>
> cpu0 at mainbus0: mid 8: TMS390Z50 v0 or TMS390Z55 @ 75 MHz, on-chip FPU
> cpu0: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K external (32 d
> cpu1 at mainbus0: mid 10: TMS390Z50 v0 or TMS390Z55 @ 75 MHz, on-chip FPU
> cpu1: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K external (32 d
> obio0 at mainbus0
>
> I have a couple of others from a spare box, but they are likely
> different models.
>
> Are there suggestions on what the next step is to determine where the
> problem originates?
>
> --Donald
>