Subject: Urgent request for help
To: , <netbsd-help@netbsd.org>
From: Monroe Williams <monroe@criticalpath.com>
List: netbsd-help
Date: 09/12/2001 17:18:17
I'm running NetBSD-1.5.1-macppc on a server where availability is rather
important.  It's currently running on a 9600, and over the last couple of
years I have seen uptimes in excess of 140 days.

I recently upgraded the storage subsystem by adding an Adaptec 29160 and
paired 18G IBM LVD drives running mirrored under softraid.  (I was
previously using the internal SCSI bus for all drives, and was looking for
better I/O throughput.)  Since the upgrade, I've been having nightmarish
SCSI problems.  I've tried several different cables, even a different SCSI
card (an Adaptec 2940U2W), and it's still giving me fits.  The problems
manifest in several ways, including:

/netbsd: Timedout SCB handled by another timeout

and

/netbsd: sd1(ahc0:9:0): SCB 16 - timed out in Data-out phase, SEQADDR ==
0x5d
/netbsd: SCSIRATE == 0x95
/netbsd: sd1(ahc0:9:0): BDR message in message buffer
/netbsd: sd1(ahc0:9:0): no longer in timeout, status = 0
/netbsd: sd1(ahc0:9:0): Unexpected busfree in Message-out phase
/netbsd: SEQADDR == 0x165
/netbsd: sd1(ahc0:9:0): parity error detected in Data-in phase.
SEQADDR(0x165) SCSIRATE(0x95)
/netbsd: sd1(ahc0:9:0): parity error detected in Data-in phase.
SEQADDR(0x166) SCSIRATE(0x95)
last message repeated 30 times
/netbsd: raid0: IO Error.  Marking /dev/sd1f as failed.
/netbsd: raid0: node (Wsd) returned fail, rolling forward
/netbsd: sd1(ahc0:9:0): Unexpected busfree in Data-out phase
/netbsd: SEQADDR == 0x165
/netbsd: sd0(ahc0:8:0): SCB 19 - timed out in Data-out phase, SEQADDR ==
0x165
/netbsd: SCSIRATE == 0x95
/netbsd: sd0(ahc0:8:0): Other SCB Timeout
/netbsd: sd0(ahc0:8:0): SCB 1c - timed out in Data-out phase, SEQADDR ==
0x165
/netbsd: SCSIRATE == 0x95
/netbsd: sd0(ahc0:8:0): Other SCB Timeout
/netbsd: sd1(ahc0:9:0): SCB 16 - timed out in Data-out phase, SEQADDR ==
0x165
/netbsd: SCSIRATE == 0x95
/netbsd: sd1(ahc0:9:0): BDR message in message buffer
/netbsd: sd1(ahc0:9:0): SCB 16 - timed out in Data-out phase, SEQADDR ==
0x165
/netbsd: SCSIRATE == 0x95
/netbsd: sd1(ahc0:9:0): no longer in timeout, status = 0
/netbsd: ahc0: Issued Channel A Bus Reset. 8 SCBs aborted
/netbsd: ahc0: target 8 using 16bit transfers
/netbsd: ahc0: target 8 synchronous at 20.0MHz, offset = 0x3f
/netbsd: ahc0: target 8 using 16bit transfers
/netbsd: ahc0: target 8 synchronous at 20.0MHz, offset = 0x3f
/netbsd: ahc0: target 9 using 16bit transfers
/netbsd: ahc0: target 9 synchronous at 20.0MHz, offset = 0x3f


I've procured a G4 which I intend to replace the 9600.  I put the 29160 and
two spare LVD drives in the new machine, and immediately started getting:

ahc0: Someone reset channel A

which is another symptom I've seen on the old machine.  It's much worse in
the G4, so bad that I can't even begin to get things done.

I need to fix this problem now.  The services running on this machine were
put there because it had negligable downtime, and I can't afford to have it
acting up.

Is this a problem that could be caused by bad cables?  I've tried several
different ones, and the problem seems to happen with all of them.

Could this be a problem with the 1.5.1 ahc driver?  Would a different brand
of SCSI card work better?

Please cc: monroe@criticalpath.com with any responses.  I'm on the
port-macppc list with my home address, but I really need to get responses as
soon as possible.

Thanks,
-- monroe
------------------------------------------------------------------------
Monroe Williams                                  monroe@criticalpath.com