current-users: Re: SDLT 320 Tape Drive on -current

Subject: Re: SDLT 320 Tape Drive on -current
To: None <current-users@netbsd.org>
From: Duncan McEwan <duncan@mcs.vuw.ac.nz>
List: current-users
Date: 09/09/2003 17:58:01
Around the end of June I posted a message describing problems we were having 
getting with a SDLT 320GB tape drive on NetBSD-current (1.6U compiled around 
mid-June).  The errors were generally along the lines of:

	st0(ahc0:0:6:0):  Check Condition on CDB: 0x0a 00 00 80 00 00
    	     SENSE KEY:  Media Error
   	    INFO FIELD:  145031168
 	  COMMAND INFO:  4953956 (0x4b9764)
    	      ASC/ASCQ:  Write Error

Since both the tape drive and the (multiple) tape(s) were pretty much brand new
I didn't think it was likely that the media error message given was correct and
I asked this list for other possibilities.

Manuel Bouyer suggested that:

> ...
> It's possible that the drive has a firmware bug, and returns the wrong
> sense information in some cases.
> ...

So then we got Win2k running on that machine and ran similar tape writing tests
using the windows backup program.  And we got tape write errors there as well!
So at that point we called in Dell and without bothering to diagnose the 
problem
further they replaced the tape drive, the scsi controller (aic7899) and the
cable!

We also upgraded this new tape drive to the latest version of the firmware
we could find on the Dell web site (released on the 1st July 2003).

Then back to NetBSD with (we thought) the problem solved.  I can't remember
how long it was before we started getting errors again.  And although they do 
seem a little less frequent than before I can't be 100% sure about that.

Some examples of the "check condition" errors we've been getting recently
are listed here.  These are not in any particular order and and extracted from
around two months of system logs.

st0(ahc0:0:6:0):  Check Condition on CDB: 0x11 01 00 00 01 00
    SENSE KEY:  Media Error
   INFO FIELD:  1
 COMMAND INFO:  1341415 (0x1477e7)
     ASC/ASCQ:  Unrecovered Read Error

st0(ahc0:0:6:0):  Check Condition on CDB: 0x10 00 00 00 02 00
    SENSE KEY:  Media Error
 COMMAND INFO:  719553 (0xafac1)
     ASC/ASCQ:  Write Error

st0(ahc0:0:6:0):  Check Condition on CDB: 0x00 00 00 00 00 00
    SENSE KEY:  Hardware Error
     ASC/ASCQ:  Diagnostic Failure on Component 0x84

st0(ahc0:0:6:0):  Check Condition on CDB: 0x11 01 00 00 46 00
    SENSE KEY:  Media Error
   INFO FIELD:  15
 COMMAND INFO:  163931 (0x2805b)
     ASC/ASCQ:  Recorded Entity Not Found

st0(ahc0:0:6:0):  Check Condition on CDB: 0x08 00 00 02 00 00
    SENSE KEY:  Media Error
   INFO FIELD:  512
 COMMAND INFO:  163931 (0x2805b)
     ASC/ASCQ:  Positioning Error Detected By Read of Medium

Perhaps some of these could have been caused by previous write operations
failing and so the data on the tape is incorrect (ie: no end of tape marker,
etc).

As well we also now occasionally get a different error that causes the kernel
to generate a scsi card register dump and causes the program that was accessing
the tape drive to block in the kernel in an unkillable state.  We actually
have to power cycle the machine in order to get the scsi controller working
again.  The register dump is 103 lines long so I won't include it in this 
message, but if you want to see it you can get it from 

	http://www.mcs.vuw.ac.nz/~duncan/ahc-dump.txt

So at this stage I'm still not sure whether we are looking at a (second)
faulty tape drive (or scsi controller) or whether there are perhaps problems
with the NetBSD ahc or st drivers.  Any advice would be gratefully accepted!

Duncan