Subject: Fwd: NetBSD 1.6.1 crashes
To: None <port-macppc@netbsd.org>
From: Donald Lee <MacPPC@caution.icompute.com>
List: port-macppc
Date: 12/19/2003 20:35:08
I may have found a way to reproduce this problem.

I'm sitting at a kernel debugger prompt.  Can anyone tell me where to
go? ;->  I'm not handy with this debugger.  OTOH, if there is
some documentation and/or a tutorial on how to puzzle out
debugging a kernel, I am literate.

Anyone?

-dgl-

>Delivered-To: port-macppc@netbsd.org
>Mime-Version: 1.0
>X-Sender: MacPPC@caution.icompute.com@mailhost.icompute.com
>Date: Fri, 19 Dec 2003 00:54:42 -0600
>To: port-macppc@netbsd.org
>From: Donald Lee <MacPPC@caution.icompute.com>
>Subject: NetBSD 1.6.1 crashes
>Sender: port-macppc-owner@NetBSD.org
>Precedence: list
>
>Still working on bringing up my new server on a beige G3 with OF 2.4.
>
>My NetBSD 1.5.2 system is just a rock.  No crashes or anomalies to speak of.
>Unfortunately, my Beige G3 (OF 2.4) has been crashing on me now and then.
>
>It appears to fall over most frequently under heavy ATA disk load.
>I have one ATA drive on the internal bus, and one drive on a PCI card
>(Sonnet dual ATA - I forget the name - it comes up in the log as
>
>	function 0: Promise Ultra1 33/ATA Bus Master IDE Accelerator (rev. 0x02)
>
>Three times, I've seen a similar crash.  Unfortunately, I've gotten a
>traceback from the debugger only once.
>
>Console output this time around:
>
>	Dec 16 19:57:42 grace afpd[19602]: bad function 7A
>	Dec 16 20:00:12 grace afpd[19602]: afp_alarm: child timed out
>	trap type 700 at 258dc0
>	Press a key to panic.
>	tlp0: receive ring overrun
>
>(hit return)
>
>	tlp0: receive ring overrun
>	panic: trap
>	Stopped in pid 21055 (find) at  cpu_Debugger+0x10:     lwz   r0, r1, 0x14
>	db>
>
>(hang.. no response to KB)
>
>From the logs, I can see that it was running daily scripts when this happened.
>
>On October 12th, I got a similar crash and the info/traceback was:
>
>	trap type 700 at 229160
>	Press a key to panic.
>
>When I hit a key, I got:
>
>	tlp0: receive ring overrun
>	panic: trap
>	Stopped in pid 15618 (ksh) at  cpu_Debugger+0x10:   lwz   r0, r1, 0x14
>
>followed by:
>
>	db> tr
>	0xdd4aacb0: at panic+174
>	0xdd4aad70: at trap+908
>	0xdd4aade0: kernel PGM trap by chgproccnt+64: srr1=0x98032
>		r1=0xddaae90 cr=0x8000d032 xer=0 ctr=0
>	  <???>  : at sys_wait4+258
>	0xdd4aae90: at 0x7e98024
>	0xdd4aaeb0: at sys_wait4+258
>	0xdd4aaee0: at trap+610
>	0xdd4aaf50: user SC trap by 0x1832250: srr1=0xd032
>		r1=0x7fffe260 cr=0x29008088 xer=0 ctr=0x1825268
>	db> 
>
>(This is transcribed by hand, so it may contain some errors.)
>
>This also appears to have happened at about 3:15 AM while the dsily script
>was running.
>
>Has anyone seen anything like this?  I was thinking it was flaky HW, but
>there is some consistency here, even though the time between crashes is
>pretty long.
>
>I want to go production with this, but I can't do that while it's
>misbehaving.
>
>Anyone have suggestions of how I can capture more information if/when
>it crashes, and/or stress it in a way that is likely to generate crashes?
>
>I'm going to see if I can get daily to run repeatly for a few days....
>
>Thanks,
>
>-dgl-
>