Subject: port-mac68k/2828: sbc scsi driver crashes cause "media damage" on MO drive
To: None <gnats-bugs@gnats.netbsd.org>
From: None <saw@sun0.urz.uni-heidelberg.de>
List: netbsd-bugs
Date: 10/10/1996 12:42:38
>Number: 2828
>Category: port-mac68k
>Synopsis: sbc scsi driver crashes cause "media damage" on MO drive
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: gnats-admin (GNATS administrator)
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Oct 10 10:35:01 1996
>Last-Modified:
>Originator: Hauke Fath
>Organization:
<< Why does send-pr.el fill in my .sig here? >>
>Release: Sep 6 1996
>Environment:
System: NetBSD se30 1.2_BETA NetBSD 1.2_BETA (BONSAI) #46: Tue Oct 8 08:19:20 GMT 1996 hauke@se30:/u/hauke/sys/arch/mac68k/compile/BONSAI mac68k
>Description:
Frequent write accesses to a Fujitsu M2512A 230MB magneto-optical
drive, especially to metadata (sup, rm -r, tar -zxf to MO),
reproducably cause the following scsi driver crash:
root:~ # sbc0: can not transfer more data
sbc0: aborting, but phase=DATA_OUT (reset)
sbc0: reset SCSI bus for TID=2 LUN=0
panic: ncr5380_scsi_cmd: polled request, abort failed
Stopped at _Debugger+0x6: unlk a6
db> t
_Debugger(19668,7ac0,8d2c58,1013,8d2c70) + 6
_panic(7ac0,0,7,2,3) + 34
_ncr5380_scsi_cmd(6c06f80) + 80
_scsi_done(6c06f80,79d8,6c048ac,6c04800,6c048ac) + 6a
_ncr5380_scsi_cmd(6c04800) + 352
_ncr5380_scsi_cmd(6c04800) + 1950
_ncr5380_scsi_cmd(6c04800) + 650
_ncr5380_scsi_cmd(6c06f80) + 110
_scsi_execute_xs(6c06f80) + 28
_scsi_scsi_cmd(6c05de0,8d2d8c,6,6ee9000,2000,4,2710,761c88,1001) + 8a
_sdstart(6c02200,6c022a6,761c88) + 1ca
_sdstrategy(761c88,8d2df0,2844a,8d2de8,0) + b4
_spec_strategy(8d2de8) + 24
_bwrite(761c88,6c54c00,1,1,0) + d8
_ffs_update(8d2e44) + 182
_ufs_setattr(8d2e7c) + 234
_sys_utimes(6c54c00,8d2f88,8d2f80) + 162
_syscall(8a) + 10a
_trap0() + e
db>
This leads to...
===========
SCSI VERIFY REPORTED THE FOLLOWING SENSE CONDITIONS:
Sense # Key Code Info Sector Description of problem
------------------------------------------------------------------
1 3 17 131462 131000 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
2 3 17 131463 131463 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
3 3 17 131464 131464 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
4 3 17 131465 131465 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
5 3 17 131466 131466 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
6 3 17 131467 131467 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
7 3 17 131468 131468 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
8 3 17 131469 131469 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
9 3 17 131470 131470 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
10 3 17 131471 131471 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
11 3 17 131472 131472 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
12 3 17 131473 131473 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
13 3 17 131474 131474 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
14 3 17 131475 131475 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
15 3 17 131476 131476 0x03, MEDIUM ERROR. 0x11, UNRECOVERED DATA ERROR.
NOTE: Problems found during the SCSI Verify.
-- There are always 15 sectors reported as media defect.
>How-To-Repeat:
Intensive write accesses to MO disk (sup, untar to disk).
>Fix:
Both the ncrscsi and the sbc MD drivers pull the reset line of the
scsi bus by default.
>From the Fujitsu M2512 "SCSI Logical Specifications" manual:
---------------------------------------------------------------------------
> (1.6.6 Reset processing)
>
> The INIT [initiator, hf] can execute the following reset processing
> methods on the SCSI bus:
> o RESET condition
> o BUS DEVICE RESET message
> o ABORT message.
[...]
> Command type: WRITE, WRITE AND VERIFY, WRITELONG
>
> Stop processing of command execution:
> The date block where data is being written is not always successfully
> processed, including the ECC field. Not all data items transferred
> from the INIT to ODD [optical disk drive, hf] may be
> written to the disk.
---------------------------------------------------------------------------
That is: Send a hard reset to your drive in case of an error -- and
if you're fast enough, you are likely to generate what shows up as
a media defect on the next access but is rather an incompletely
written block.
I see two alternatives:
a) try a 'TERMINATE I/O PROCESS (11h)' first, as it is guaranteed to
maintain data integrity on the target, or
b) wait a sufficiently long time (say, half a second) to let the
target complete whatever it's doing *before* you issue a reset.
BTW, there was a related issue with the sleep/wakeup of an MO drive
and the ncrscsi driver. The driver wouldn't give the target enough
time to spin up and respond, but instead hit it with a hard reset
again and again, causing it to spin down and up, and...
>Audit-Trail:
>Unformatted: