Subject: weird SCSI (ahc) failures
To: NetBSD-current Discussion List <current-users@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: current-users
Date: 09/09/2002 17:52:12
I have an Adaptec wide differential controller and a box-o-disks
connected to my development server and am using them to build a nice big
/home with RAIDframe.

A while ago raid0 presumably used up its hot spare (though that may not
be the case -- see my next posting) and I thought I had to start looking
for a replacement disk.

However since 4.3GB wide diff. disks are rather rare and I wasn't
getting any leads, I thought I'd better  dig deeper to see what the real
problem might be.  Now I'm not so sure it is a disk problem.

Here's the config for the controller and disks in question:

ahc1 at pci0 dev 13 function 0
ahc1: interrupting at irq 15
ahc1: aic7880 Wide Channel A, SCSI Id=7, 16/255 SCBs
scsibus2 at ahc1: 16 targets, 8 luns per target
scsibus2: waiting 2 seconds for devices to settle...
sd6 at scsibus2 target 0 lun 0: <QUANTUM, XP34550WD, LYK8> SCSI2 0/direct fixed
sd6: 4341 MB, 5899 cyl, 10 head, 150 sec, 512 bytes/sect x 8890760 sectors
sd6: sync (100.0ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd7 at scsibus2 target 1 lun 0: <QUANTUM, XP34550WD, LYK8> SCSI2 0/direct fixed
sd7: 4341 MB, 5899 cyl, 10 head, 150 sec, 512 bytes/sect x 8890760 sectors
sd7: sync (100.0ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd8 at scsibus2 target 2 lun 0: <QUANTUM, XP34550WD, LXY4> SCSI2 0/direct fixed
sd8: 4341 MB, 5899 cyl, 10 head, 150 sec, 512 bytes/sect x 8890760 sectors
sd8: sync (100.0ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd9 at scsibus2 target 3 lun 0: <QUANTUM, XP34550WD, LXY4> SCSI2 0/direct fixed
sd9: 4341 MB, 5899 cyl, 10 head, 150 sec, 512 bytes/sect x 8890760 sectors
sd9: sync (100.0ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd10 at scsibus2 target 5 lun 0: <QUANTUM, XP34550WD, LXY4> SCSI2 0/direct fixed
sd10: 4341 MB, 5899 cyl, 10 head, 150 sec, 512 bytes/sect x 8890760 sectors
sd10: sync (100.0ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd11 at scsibus2 target 6 lun 0: <QUANTUM, XP34550WD, LYK8> SCSI2 0/direct fixed
sd11: 4341 MB, 5899 cyl, 10 head, 150 sec, 512 bytes/sect x 8890760 sectors
sd11: sync (100.0ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd12 at scsibus2 target 8 lun 0: <QUANTUM, XP34550WD, LXY4> SCSI2 0/direct fixed
sd12: 4341 MB, 5899 cyl, 10 head, 150 sec, 512 bytes/sect x 8890760 sectors
sd12: sync (100.0ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing


Here's the first error:

Aug 12 05:39:06 proven /netbsd: sd11(ahc1:0:6:0): Unexpected busfree in Data-in phase
Aug 12 05:39:06 proven /netbsd: SEQADDR == 0x112
Aug 12 05:39:07 proven /netbsd: sd11(ahc1:0:6:0): invalid return code from adapter: 3
Aug 12 05:39:07 proven /netbsd: raid0: IO Error.  Marking /dev/sd11d as failed.
Aug 12 05:39:08 proven /netbsd: raid0: node (R  ) returned fail, rolling backward
Aug 12 05:39:08 proven /netbsd: raid0: DAG failure: r addr 0x1a2d4af (27448495) nblk 0x10 (16) buf 0xc5aa2000


Then more on another disk:

Aug 19 06:49:44 proven /netbsd: sd12(ahc1:0:8:0): Unexpected busfree in Message-in phase
Aug 19 06:49:44 proven /netbsd: SEQADDR == 0xbb
Aug 19 06:49:45 proven /netbsd: sd12(ahc1:0:8:0): invalid return code from adapter: 3
Aug 20 06:48:41 proven /netbsd: sd12(ahc1:0:8:0): Unexpected busfree in Message-in phase
Aug 20 06:48:41 proven /netbsd: SEQADDR == 0x155
Aug 20 06:48:41 proven /netbsd: sd12(ahc1:0:8:0): invalid return code from adapter: 3
Aug 21 06:48:58 proven /netbsd: sd12(ahc1:0:8:0): Unexpected busfree in Command phase
Aug 21 06:48:58 proven /netbsd: SEQADDR == 0xbb
Aug 21 06:48:58 proven /netbsd: sd12(ahc1:0:8:0): invalid return code from adapter: 3
Aug 26 06:48:57 proven /netbsd: sd12(ahc1:0:8:0): Unexpected busfree in Message-out phase
Aug 26 06:48:57 proven /netbsd: SEQADDR == 0xbb
Aug 26 06:48:58 proven /netbsd: sd12(ahc1:0:8:0): invalid return code from adapter: 3
Aug 27 06:48:46 proven /netbsd: sd12(ahc1:0:8:0): Unexpected busfree in Message-in phase
Aug 27 06:48:46 proven /netbsd: SEQADDR == 0xbb
Aug 27 06:48:46 proven /netbsd: sd12(ahc1:0:8:0): invalid return code from adapter: 3
Aug 28 06:48:42 proven /netbsd: sd12(ahc1:0:8:0): Unexpected busfree in Command phase
Aug 28 06:48:43 proven /netbsd: SEQADDR == 0x155
Aug 28 06:48:43 proven /netbsd: sd12(ahc1:0:8:0): invalid return code from adapter: 3


Oddly though I haven't noticed any further problems, and the errors on
sd12 didn't seem to cause further RAIDframe problems either.

Later tonight I'll try rebooting and re-activating sd11 -- this is 1.5W
and there are some possibilities that bits of kernel storage have been
corrupted....

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>