Subject: Re: SDLT 320 Tape Drive on -current
To: None <current-users@netbsd.org>
From: Duncan McEwan <duncan@mcs.vuw.ac.nz>
List: current-users
Date: 09/09/2003 17:58:01
Around the end of June I posted a message describing problems we were having
getting with a SDLT 320GB tape drive on NetBSD-current (1.6U compiled around
mid-June). The errors were generally along the lines of:
st0(ahc0:0:6:0): Check Condition on CDB: 0x0a 00 00 80 00 00
SENSE KEY: Media Error
INFO FIELD: 145031168
COMMAND INFO: 4953956 (0x4b9764)
ASC/ASCQ: Write Error
Since both the tape drive and the (multiple) tape(s) were pretty much brand new
I didn't think it was likely that the media error message given was correct and
I asked this list for other possibilities.
Manuel Bouyer suggested that:
> ...
> It's possible that the drive has a firmware bug, and returns the wrong
> sense information in some cases.
> ...
So then we got Win2k running on that machine and ran similar tape writing tests
using the windows backup program. And we got tape write errors there as well!
So at that point we called in Dell and without bothering to diagnose the
problem
further they replaced the tape drive, the scsi controller (aic7899) and the
cable!
We also upgraded this new tape drive to the latest version of the firmware
we could find on the Dell web site (released on the 1st July 2003).
Then back to NetBSD with (we thought) the problem solved. I can't remember
how long it was before we started getting errors again. And although they do
seem a little less frequent than before I can't be 100% sure about that.
Some examples of the "check condition" errors we've been getting recently
are listed here. These are not in any particular order and and extracted from
around two months of system logs.
st0(ahc0:0:6:0): Check Condition on CDB: 0x11 01 00 00 01 00
SENSE KEY: Media Error
INFO FIELD: 1
COMMAND INFO: 1341415 (0x1477e7)
ASC/ASCQ: Unrecovered Read Error
st0(ahc0:0:6:0): Check Condition on CDB: 0x10 00 00 00 02 00
SENSE KEY: Media Error
COMMAND INFO: 719553 (0xafac1)
ASC/ASCQ: Write Error
st0(ahc0:0:6:0): Check Condition on CDB: 0x00 00 00 00 00 00
SENSE KEY: Hardware Error
ASC/ASCQ: Diagnostic Failure on Component 0x84
st0(ahc0:0:6:0): Check Condition on CDB: 0x11 01 00 00 46 00
SENSE KEY: Media Error
INFO FIELD: 15
COMMAND INFO: 163931 (0x2805b)
ASC/ASCQ: Recorded Entity Not Found
st0(ahc0:0:6:0): Check Condition on CDB: 0x08 00 00 02 00 00
SENSE KEY: Media Error
INFO FIELD: 512
COMMAND INFO: 163931 (0x2805b)
ASC/ASCQ: Positioning Error Detected By Read of Medium
Perhaps some of these could have been caused by previous write operations
failing and so the data on the tape is incorrect (ie: no end of tape marker,
etc).
As well we also now occasionally get a different error that causes the kernel
to generate a scsi card register dump and causes the program that was accessing
the tape drive to block in the kernel in an unkillable state. We actually
have to power cycle the machine in order to get the scsi controller working
again. The register dump is 103 lines long so I won't include it in this
message, but if you want to see it you can get it from
http://www.mcs.vuw.ac.nz/~duncan/ahc-dump.txt
So at this stage I'm still not sure whether we are looking at a (second)
faulty tape drive (or scsi controller) or whether there are perhaps problems
with the NetBSD ahc or st drivers. Any advice would be gratefully accepted!
Duncan