Subject: adaptec 2940 disaster
To: None <port-i386@NetBSD.ORG>
From: Carl Shapiro <samsara@panix.com>
List: port-i386
Date: 10/04/1997 05:24:03
I was doing a *lot* of disk I/O and thrashing my CPU when suddenly my 
console froze.  Oh no...

I rebooted an started to fsck by hand.  While fsck'ing / I got the following
error:

ahc1: ahc_scsi_cmd: more than 256 DMA segs

And then my console froze.  So I rebooted.  Bad stuff began to happen.
For some reason the 2940 couldn't do a mode sense with my Seagate
Barracude (st1505n) and then claimed it had to use a "ficticious
geometry".  Then when the machine was trying to mount root on sd0a I
got the following errors:

sd0(ahc:1:1:0): timed out in datain phase, SCSISIGI == 0x47
sd0(ahc:1:1:0): asserted ATN - device reset in message buffer 
sd0(ahc:1:1:0): timed out in datain phase, SCSISIGI == 0xb6
ahc1: Issued Channel A bus reset #1, 1 SCB aborted

It reported that mounting root failed with error 79, and tried to mount
root again... more errors:

sd0(ahc:1:1:0): timed out in datain phase, SCSISIGI == 0xe6
sd0(ahc:1:1:0): asserted ATN - device reset in message buffer 
sd0(ahc:1:1:0): timed out in datain phase, SCSISIGI == 0xfb
ahc1: Issued Channel A bus reset #1, 1 SCB aborted
sd0(ahc:1:1:0): timed out in datain phase, SCSISIGI == 0x0
ahc1: Issued Channel A bus reset #2, 1 SCB aborted

I tried mounting root again... more errors like the above.  I tried
again and again and again and again.  I then gave up, I powered down,
and waited.  This time the 2940 could sense the geometry of my drive,
and I was able to mount root.  Unfortunately I had lost /bin/sh (turns
out that I lost all of /bin) and I couldn't boot single user mode.

I threw a spare drive (a Fujitsu Pico 7 M1606S) on my SCSI chain and
booted NetBSD from that.  The machine booted flawlessly.  However
almost all of the filesystems on my Barracuda are completely blown away
except for my tiny (16 meg) / which only lost /bin.  When I tried to 
fsck the /usr of the Barracuda (I used fsck -y because there were just
*so* many errors) I got the familiar: 

ahc1: ahc_scsi_cmd: more than 256 DMA segs

But the machine didn't hang.  Oh well.

Given that I have not touched anything inside of the ahc driver, what
could have caused this disaster?  How can I prevent such a thing from
happing again?  My system more or less looks like this:

	Intel PR440FX Motherboard
	Pentium Pro 200
	64 megabytes RAM
	on board Adaptec 2940UW (unused)
	Adaptec 2940
	on board Intel EtherExpress Pro 10/100B (also unused)
	
Is the bha or isp drivers more stable than the ahc on the i386?  I wouldn't
mind dropping $200 for a more reliable system.


Carl