port-macppc: NetBSD 1.6.1 crashes

Subject: NetBSD 1.6.1 crashes
To: None <port-macppc@netbsd.org>
From: Donald Lee <MacPPC@caution.icompute.com>
List: port-macppc
Date: 12/19/2003 00:54:42
Still working on bringing up my new server on a beige G3 with OF 2.4.

My NetBSD 1.5.2 system is just a rock.  No crashes or anomalies to speak of.
Unfortunately, my Beige G3 (OF 2.4) has been crashing on me now and then.

It appears to fall over most frequently under heavy ATA disk load.
I have one ATA drive on the internal bus, and one drive on a PCI card
(Sonnet dual ATA - I forget the name - it comes up in the log as

	function 0: Promise Ultra1 33/ATA Bus Master IDE Accelerator (rev. 0x02)

Three times, I've seen a similar crash.  Unfortunately, I've gotten a
traceback from the debugger only once.

Console output this time around:

	Dec 16 19:57:42 grace afpd[19602]: bad function 7A
	Dec 16 20:00:12 grace afpd[19602]: afp_alarm: child timed out
	trap type 700 at 258dc0
	Press a key to panic.
	tlp0: receive ring overrun

(hit return)

	tlp0: receive ring overrun
	panic: trap
	Stopped in pid 21055 (find) at  cpu_Debugger+0x10:     lwz   r0, r1, 0x14
	db>

(hang.. no response to KB)

From the logs, I can see that it was running daily scripts when this happened.

On October 12th, I got a similar crash and the info/traceback was:

	trap type 700 at 229160
	Press a key to panic.

When I hit a key, I got:

	tlp0: receive ring overrun
	panic: trap
	Stopped in pid 15618 (ksh) at  cpu_Debugger+0x10:   lwz   r0, r1, 0x14

followed by:

	db> tr
	0xdd4aacb0: at panic+174
	0xdd4aad70: at trap+908
	0xdd4aade0: kernel PGM trap by chgproccnt+64: srr1=0x98032
		r1=0xddaae90 cr=0x8000d032 xer=0 ctr=0
	  <???>  : at sys_wait4+258
	0xdd4aae90: at 0x7e98024
	0xdd4aaeb0: at sys_wait4+258
	0xdd4aaee0: at trap+610
	0xdd4aaf50: user SC trap by 0x1832250: srr1=0xd032
		r1=0x7fffe260 cr=0x29008088 xer=0 ctr=0x1825268
	db> 

(This is transcribed by hand, so it may contain some errors.)

This also appears to have happened at about 3:15 AM while the dsily script
was running.

Has anyone seen anything like this?  I was thinking it was flaky HW, but
there is some consistency here, even though the time between crashes is
pretty long.

I want to go production with this, but I can't do that while it's
misbehaving.

Anyone have suggestions of how I can capture more information if/when
it crashes, and/or stress it in a way that is likely to generate crashes?

I'm going to see if I can get daily to run repeatly for a few days....

Thanks,

-dgl-