Subject: Fwd: NetBSD 1.6.1 crashes - solved
To: None <port-macppc@netbsd.org>
From: Donald Lee <MacPPC@caution.icompute.com>
List: port-macppc
Date: 12/21/2003 18:17:06
Good new for you all, bad news for me.

My problem appears to have been HW.  The Sonnet 500 Mhz G3 upgrade is
not as reliable in the Beige G3 as the original 266 Mhz CPU.

Once I was able to reproduce the problem, it was pretty quick to figure
out what was causing it.  No interesting kernel debugging required.
One clue though, was that there was little consistency in precisely
where in the kernel it was crashing.

-dgl-

At 6:10 PM -0600 12/21/03, Donald Lee wrote:
>I may have found a way to reproduce this problem.
>
>I'm sitting at a kernel debugger prompt.  Can anyone tell me where to
>go? ;->  I'm not handy with this debugger.  OTOH, if there is
>some documentation and/or a tutorial on how to puzzle out
>debugging a kernel, I am literate.
>
>Anyone?
>
>-dgl-
>
>>Delivered-To: port-macppc@netbsd.org
>>Mime-Version: 1.0
>>X-Sender: MacPPC@caution.icompute.com@mailhost.icompute.com
>>Date: Fri, 19 Dec 2003 00:54:42 -0600
>>To: port-macppc@netbsd.org
>>From: Donald Lee <MacPPC@caution.icompute.com>
>>Subject: NetBSD 1.6.1 crashes
>>Sender: port-macppc-owner@NetBSD.org
>>Precedence: list
>>
>>Still working on bringing up my new server on a beige G3 with OF 2.4.
>>
>>My NetBSD 1.5.2 system is just a rock.  No crashes or anomalies to speak of.
>>Unfortunately, my Beige G3 (OF 2.4) has been crashing on me now and then.
>>
>>It appears to fall over most frequently under heavy ATA disk load.
>>I have one ATA drive on the internal bus, and one drive on a PCI card
>>(Sonnet dual ATA - I forget the name - it comes up in the log as
>>
>>	function 0: Promise Ultra1 33/ATA Bus Master IDE Accelerator (rev. 0x02)
>>
>>Three times, I've seen a similar crash.  Unfortunately, I've gotten a
>>traceback from the debugger only once.
>>
>>Console output this time around:
>>
>>	Dec 16 19:57:42 grace afpd[19602]: bad function 7A
>>	Dec 16 20:00:12 grace afpd[19602]: afp_alarm: child timed out
>>	trap type 700 at 258dc0
>>	Press a key to panic.
>>	tlp0: receive ring overrun
>>
>>(hit return)
>>
>>	tlp0: receive ring overrun
>>	panic: trap
>>	Stopped in pid 21055 (find) at  cpu_Debugger+0x10:     lwz   r0, r1, 0x14
>>	db>
>>
>>(hang.. no response to KB)
>>
>>From the logs, I can see that it was running daily scripts when this happened.
>>
>>On October 12th, I got a similar crash and the info/traceback was:
>>
>>	trap type 700 at 229160
>>	Press a key to panic.
>>
>>When I hit a key, I got:
>>
>>	tlp0: receive ring overrun
>>	panic: trap
>>	Stopped in pid 15618 (ksh) at  cpu_Debugger+0x10:   lwz   r0, r1, 0x14
>>
>>followed by:
>>
>>	db> tr
>>	0xdd4aacb0: at panic+174
>>	0xdd4aad70: at trap+908
>>	0xdd4aade0: kernel PGM trap by chgproccnt+64: srr1=0x98032
>>		r1=0xddaae90 cr=0x8000d032 xer=0 ctr=0
>>	  <???>  : at sys_wait4+258
>>	0xdd4aae90: at 0x7e98024
>>	0xdd4aaeb0: at sys_wait4+258
>>	0xdd4aaee0: at trap+610
>>	0xdd4aaf50: user SC trap by 0x1832250: srr1=0xd032
>>		r1=0x7fffe260 cr=0x29008088 xer=0 ctr=0x1825268
>>	db> 
>>
>>(This is transcribed by hand, so it may contain some errors.)
>>
>>This also appears to have happened at about 3:15 AM while the dsily script
>>was running.
>>
>>Has anyone seen anything like this?  I was thinking it was flaky HW, but
>>there is some consistency here, even though the time between crashes is
>>pretty long.
>>
>>I want to go production with this, but I can't do that while it's
>>misbehaving.
>>
>>Anyone have suggestions of how I can capture more information if/when
>>it crashes, and/or stress it in a way that is likely to generate crashes?
>>
>>I'm going to see if I can get daily to run repeatly for a few days....
>>
>>Thanks,
>>
>>-dgl-
>>