current-users: ongoing problems with SIOP vs. Exabyte tape drive

Subject: ongoing problems with SIOP vs. Exabyte tape drive
To: NetBSD-current Discussion List <current-users@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: current-users
Date: 06/21/2001 15:52:21
This should probably be a send-pr, but I thought I'd post here first and
see if anyone can confirm what I'm seeing....

These problems are happening with the 2001/06/19 sources, though IIRC
nothing's changed substantially since 2001/03/24.  Here are the probe
messages for the devices in question:

siop0 at pci0 dev 14 function 0: Symbios Logic 53c875 (ultra-wide scsi)
siop0: using on-board RAM
siop0: interrupting at irq 14
scsibus1 at siop0: 16 targets, 8 luns per target
scsibus1: waiting 2 seconds for devices to settle...
siop0: alloc newcdb at PHY addr 0x23000
st0 at scsibus1 target 4 lun 0: <EXABYTE, EXB-85058HE-0000, 0108> SCSI2 1/sequential removable
st0: density code 0, 1024-byte blocks, write-enabled
st0: sync (200.0ns offset 15), 8-bit (5.000MB/s) transfers
ch0 at scsibus1 target 5 lun 0: <EXABYTE, EXB-210, 3.08> SCSI2 8/changer removable
ch0: 11 slots, 1 drive, 1 picker, 0 portals
ch0: async, 8-bit transfers

(those are the only devices on that host adapter....)

The first problem is kinda weird because it doesn't seem to happen all
the time -- only after a bus-reset.  In that case even a status inquiry
fails with a "not ready".

Jun 20 15:20:48 proven /netbsd: st0(siop0:0:4:0): command timeout
Jun 20 15:20:48 proven /netbsd: siop0: scsi bus reset
Jun 20 15:20:48 proven /netbsd: st0(siop0:0:4:0): command with tag id 0 reset
Jun 20 15:20:49 proven /netbsd: st0: async, 8-bit transfers
Jun 20 15:20:49 proven /netbsd: ch0(siop0:0:5:0):  Check Condition on CDB: 0xb8 00 00 52 00 01 00 00 00 08 00 00
Jun 20 15:20:49 proven /netbsd:     SENSE KEY:  Not Ready
Jun 20 15:20:49 proven /netbsd:      ASC/ASCQ:  Logical Unit Is in Process Of Becoming Ready
Jun 20 15:20:49 proven /netbsd: 
Jun 20 15:20:49 proven /netbsd: st0: sync (200.0ns offset 15), 8-bit (5.000MB/s) transfers
Jun 20 15:20:49 proven /netbsd: st0(siop0:0:4:0):  Check Condition on CDB: 0x00 00 00 00 00 00
Jun 20 15:20:49 proven /netbsd:     SENSE KEY:  Not Ready
Jun 20 15:20:49 proven /netbsd:      ASC/ASCQ:  Logical Unit Is in Process Of Becoming Ready
Jun 20 15:20:50 proven /netbsd: 


In general it seems to me that the driver should always wait if the
device is "in the process of becoming ready".  I.e. it should never
return an I/O error for any reason (at least not until some reasonable
timeout has expired).


Now the cause of the above reset definitely seems to be a real bug.  I'd
just loaded an Amanda tape and wanted to read the label.  I tried a
straight "dd if=/dev/nrst0 | less", but the driver complained a bunch of
times and returned I/O errors before things got stuck (and then the
reset happened):

Jun 20 15:16:47 proven /netbsd: st0(siop0:0:4:0):  Check Condition on CDB: 0x08 00 00 80 00 00
Jun 20 15:16:47 proven /netbsd:     SENSE KEY:  Blank Check
Jun 20 15:16:47 proven /netbsd:                 EOM Detected
Jun 20 15:16:47 proven /netbsd:    INFO FIELD:  32768
Jun 20 15:16:48 proven /netbsd:      ASC/ASCQ:  End-Of-Data Detected
Jun 20 15:16:48 proven /netbsd: 

Note the command timeout and reset happened all in the same second as
the error (and the previous identical error was only  two seconds
earlier)!


Now the next time the reset happened it seems to have been because of a
"not ready" from the changer (as it was moving and loading a tape).  I
was trying to do a manual "amtape DailySet1 show" at the time:

Jun 20 15:21:37 proven /netbsd: ch0(siop0:0:5:0):  Check Condition on CDB: 0xb8 00 00 00 00 0b 00 00 00 08 00 00
Jun 20 15:21:37 proven /netbsd:     SENSE KEY:  Not Ready
Jun 20 15:21:37 proven /netbsd:      ASC/ASCQ:  Logical Unit Is in Process Of Becoming Ready
Jun 20 15:21:38 proven /netbsd: 
Jun 20 15:39:07 proven /netbsd: st0(siop0:0:4:0): command timeout
Jun 20 15:39:07 proven /netbsd: siop0: scsi bus reset
Jun 20 15:39:07 proven /netbsd: st0(siop0:0:4:0): command with tag id 0 reset
Jun 20 15:39:07 proven /netbsd: st0: async, 8-bit transfers
Jun 20 15:39:30 proven /netbsd: st0: sync (200.0ns offset 15), 8-bit (5.000MB/s) transfers
Jun 20 15:39:31 proven /netbsd: st0(siop0:0:4:0):  Check Condition on CDB: 0x00 00 00 00 00 00
Jun 20 15:39:31 proven /netbsd:     SENSE KEY:  Not Ready
Jun 20 15:39:31 proven /netbsd:      ASC/ASCQ:  Logical Unit Is in Process Of Becoming Ready
Jun 20 15:39:31 proven /netbsd: 

Somewhat later I tried tackling the issue again and with this second
reset the tape driver suddenly and mysteriously started logging a far
more useful error message (giving the real cause of the dd failure):

Jun 20 19:36:35 proven /netbsd: st0: 32768-byte tape record too big for 512-byte user buffer
Jun 20 19:36:35 proven /netbsd: st0(siop0:0:4:0):  Check Condition on CDB: 0x08 00 00 02 00 00
Jun 20 19:36:35 proven /netbsd:     SENSE KEY:  No Additional Sense
Jun 20 19:36:35 proven /netbsd:                 Incorrect Length Indicator Set
Jun 20 19:36:35 proven /netbsd:    INFO FIELD:  -32256
Jun 20 19:36:35 proven /netbsd:  COMMAND INFO:  1 (0x1)
Jun 20 19:36:35 proven /netbsd:      ASC/ASCQ:  No Additional Sense Information
Jun 20 19:36:35 proven /netbsd: 


Now here's what's really weird.  Once I got the tape to read the label
correclty with "dd bs=32768" then Amanda was suddenly able to scan the
entire changer and successfully read all of the labels on all of the
tapes with nothing more than some more spurious "not ready" messages on
the console (from ch0).



Unfortunately Amanda's still unable to actually write a backup, not on
any of the ten tapes in the changer.  Note that these are brand new
Exabyte-brand tapes and they've all been retensioned and successfully
labeled.  Everything worked just fine before I switched to siop (modulo
that I also had some old tapes with a few too many errors).  (And I
refuse to believe that ten brand new tapes can all have errors not at
the very beginning, but within their first few megabytes....)

Jun 20 20:02:39 proven /netbsd: st0(siop0:0:4:0):  DEFERRED ERROR, key = 0x3
Jun 20 20:02:40 proven /netbsd: st0(siop0:0:4:0):  Check Condition on CDB: 0x0a 00 00 80 00 00
Jun 20 20:02:40 proven /netbsd:     SENSE KEY:  Media Error
Jun 20 20:02:40 proven /netbsd:    INFO FIELD:  32768
Jun 20 20:02:40 proven /netbsd:      ASC/ASCQ:  Excessive Write Errors
Jun 20 20:02:40 proven /netbsd: 
Jun 20 20:02:40 proven /netbsd: st0(siop0:0:4:0):  Check Condition on CDB: 0x10 00 00 00 02 00
Jun 20 20:02:40 proven /netbsd:     SENSE KEY:  Media Error
Jun 20 20:02:40 proven /netbsd:      ASC/ASCQ:  Excessive Write Errors
Jun 20 20:02:41 proven /netbsd: 

Amanda then dutifully tries another tape (because I configured it so)
and encounters the same error.

Jun 20 20:04:33 proven /netbsd: st0(siop0:0:4:0):  DEFERRED ERROR, key = 0x3
Jun 20 20:04:33 proven /netbsd: st0(siop0:0:4:0):  Check Condition on CDB: 0x0a 00 00 80 00 00
Jun 20 20:04:33 proven /netbsd:     SENSE KEY:  Media Error
Jun 20 20:04:33 proven /netbsd:    INFO FIELD:  32768
Jun 20 20:04:34 proven /netbsd:      ASC/ASCQ:  Excessive Write Errors
Jun 20 20:04:34 proven /netbsd: 
Jun 20 20:04:34 proven /netbsd: st0(siop0:0:4:0):  Check Condition on CDB: 0x10 00 00 00 02 00
Jun 20 20:04:34 proven /netbsd:     SENSE KEY:  Media Error
Jun 20 20:04:34 proven /netbsd:      ASC/ASCQ:  Excessive Write Errors
Jun 20 20:04:34 proven /netbsd: 

Note that 'dd' encounters the same trouble:

	$ dd if=/var/NetBSD-release/binary/sets/base.tgz of=/dev/nrst0
	dd: /dev/nrst0: Input/output error
	5913+0 records in
	5912+0 records out
	3026944 bytes transferred in 57.540 secs (52605 bytes/sec)

and the kernel says:

Jun 21 15:15:06 proven /netbsd: st0(siop0:0:4:0):  DEFERRED ERROR, key = 0x3
Jun 21 15:15:06 proven /netbsd: st0(siop0:0:4:0):  Check Condition on CDB: 0x0a 00 00 02 00 00
Jun 21 15:15:06 proven /netbsd:     SENSE KEY:  Media Error
Jun 21 15:15:06 proven /netbsd:    INFO FIELD:  512
Jun 21 15:15:07 proven /netbsd:      ASC/ASCQ:  Excessive Write Errors
Jun 21 15:15:07 proven /netbsd: 
Jun 21 15:15:07 proven /netbsd: st0(siop0:0:4:0):  Check Condition on CDB: 0x10 00 00 00 02 00
Jun 21 15:15:07 proven /netbsd:     SENSE KEY:  Media Error
Jun 21 15:15:07 proven /netbsd:      ASC/ASCQ:  Excessive Write Errors
Jun 21 15:15:07 proven /netbsd: 

The SCSI manual for this drive says, unfortunately, that an ASC/ASCQ of
0x3,0x2 can refer to any of a wide variety of fault symptom codes.  It
would appear to me though that the driver's not reporting the exact FSC.

If you ask me I'd almost have to guess that the driver is getting an
error when writing the end-of-file mark, since 'dd' gets the error after
sending all but the very last block of the file (and may even be getting
the error in the close() -- I/we should fix the code to print the
failing call...).  Indeed two of the FSCs that could cause a "media
error" with "excessive write errors" are "write {set,file}mark failure
after internal retry limit exceeded", and there's also "the read-back-
check operation detected a different block type from what was written
(old data is not being fully overwritten)".  For the first two FSCs
there's a footnote:

	If the read-back-check criteria are not met for an LBOT, LBOP,
	filemark, setmark, or EOD block written to tape, the tape drive
	moves the tape backward and retries the operation 11 times.  If
	the read-back-check criteria are still not met, the tape drive
	returns Check Condition status and these FSCs apply.

So, the obvious first question is how do I get the driver to report the
Fault Symptom Code field?  (It's returned by a REQUEST SENSE (0x03)
command in byte 28 of the Extended Sense Bytes and is an "Exabyte-
unique byte that specifies the reason for the most recent Check
Condition Status.")


Finally here's one more quirk.  'mt status' no longer seems to report
the proper "current file number" or "current block number".  This is an
Amanda-labeled tape which means that the first file on the tape is the
label (I don't know if there's actually a second file on this particular
tape, or not, but I don't think so because any subsequent "mt fsf"
returns immediately, while the first has a short pause), and the first
file is also just one (32kb) block long.

	$ mt rewind
	$ mt status
	SCSI tape drive, residual=0
	ds=3<Mounted>
	er=0
	blocksize: 0 (0, 0, 0, 0)
	density: 0 (0, 0, 0, 0)
	current file number: 0
	current block number: 0
	$ mt fsf
	$ mt status
	SCSI tape drive, residual=0
	ds=3<Mounted>
	er=0
	blocksize: 0 (0, 0, 0, 0)
	density: 0 (0, 0, 0, 0)
	current file number: 0
	current block number: 0

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>     <woods@robohack.ca>
Planix, Inc. <woods@planix.com>;   Secrets of the Weird <woods@weird.com>