Subject: port-mac68k/2828: sbc scsi driver crashes cause "media damage" on MO drive
To: None <gnats-bugs@gnats.netbsd.org>
From: None <saw@sun0.urz.uni-heidelberg.de>
List: netbsd-bugs
Date: 10/10/1996 12:42:38
>Number:         2828
>Category:       port-mac68k
>Synopsis:       sbc scsi driver crashes cause "media damage" on MO drive
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    gnats-admin (GNATS administrator)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Oct 10 10:35:01 1996
>Last-Modified:
>Originator:     Hauke Fath
>Organization:
<< Why does send-pr.el fill in my .sig here? >>
>Release:        Sep 6 1996
>Environment:
	
System: NetBSD se30 1.2_BETA NetBSD 1.2_BETA (BONSAI) #46: Tue Oct 8 08:19:20 GMT 1996 hauke@se30:/u/hauke/sys/arch/mac68k/compile/BONSAI mac68k

>Description:

Frequent write accesses to a Fujitsu M2512A 230MB magneto-optical
drive, especially to metadata (sup, rm -r, tar -zxf to MO),
reproducably cause the following scsi driver crash:

root:~ # sbc0: can not transfer more data
sbc0: aborting, but phase=DATA_OUT (reset)
sbc0: reset SCSI bus for TID=2 LUN=0
panic: ncr5380_scsi_cmd: polled request, abort failed
Stopped at      _Debugger+0x6:  unlk    a6
db> t
_Debugger(19668,7ac0,8d2c58,1013,8d2c70) + 6
_panic(7ac0,0,7,2,3) + 34
_ncr5380_scsi_cmd(6c06f80) + 80
_scsi_done(6c06f80,79d8,6c048ac,6c04800,6c048ac) + 6a
_ncr5380_scsi_cmd(6c04800) + 352
_ncr5380_scsi_cmd(6c04800) + 1950
_ncr5380_scsi_cmd(6c04800) + 650
_ncr5380_scsi_cmd(6c06f80) + 110
_scsi_execute_xs(6c06f80) + 28
_scsi_scsi_cmd(6c05de0,8d2d8c,6,6ee9000,2000,4,2710,761c88,1001) + 8a
_sdstart(6c02200,6c022a6,761c88) + 1ca
_sdstrategy(761c88,8d2df0,2844a,8d2de8,0) + b4
_spec_strategy(8d2de8) + 24
_bwrite(761c88,6c54c00,1,1,0) + d8
_ffs_update(8d2e44) + 182
_ufs_setattr(8d2e7c) + 234
_sys_utimes(6c54c00,8d2f88,8d2f80) + 162
_syscall(8a) + 10a
_trap0() + e
db>


This leads to...

===========
SCSI VERIFY REPORTED THE FOLLOWING SENSE CONDITIONS:

 Sense #  Key Code        Info      Sector  Description of problem
------------------------------------------------------------------
       1    3   17      131462      131000  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
       2    3   17      131463      131463  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
       3    3   17      131464      131464  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
       4    3   17      131465      131465  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
       5    3   17      131466      131466  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
       6    3   17      131467      131467  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
       7    3   17      131468      131468  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
       8    3   17      131469      131469  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
       9    3   17      131470      131470  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
      10    3   17      131471      131471  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
      11    3   17      131472      131472  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
      12    3   17      131473      131473  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
      13    3   17      131474      131474  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
      14    3   17      131475      131475  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
      15    3   17      131476      131476  0x03, MEDIUM ERROR.  0x11, UNRECOVERED DATA ERROR.
NOTE:  Problems found during the SCSI Verify.


-- There are always 15 sectors reported as media defect.


>How-To-Repeat:
Intensive write accesses to MO disk (sup, untar to disk).

>Fix:
Both the ncrscsi and the sbc MD drivers pull the reset line of the
scsi bus by default.

>From the Fujitsu M2512 "SCSI Logical Specifications" manual:

---------------------------------------------------------------------------
> (1.6.6 Reset processing)
>
> The INIT [initiator, hf] can execute the following reset processing 
> methods on the SCSI bus:
> o  RESET condition
> o  BUS DEVICE RESET message
> o  ABORT message.

[...]

> Command type: WRITE, WRITE AND VERIFY, WRITELONG
>
> Stop processing of command execution:
> The date block where data is being written is not always successfully
> processed, including the ECC field. Not all data items transferred 
> from the INIT to ODD [optical disk drive, hf] may be 
> written to the disk.
---------------------------------------------------------------------------

That is: Send a hard reset to your drive in case of an error -- and 
if you're fast enough, you are likely to generate what shows up as 
a media defect on the next access but is rather an incompletely 
written block.

I see two alternatives:

a) try a 'TERMINATE I/O PROCESS (11h)' first, as it is guaranteed to
maintain data integrity on the target, or

b) wait a sufficiently long time (say, half a second) to let the 
target complete whatever it's doing *before* you issue a reset.

BTW, there was a related issue with the sleep/wakeup of an MO drive 
and the ncrscsi driver. The driver wouldn't give the target enough 
time to spin up and respond, but instead hit it with a hard reset 
again and again, causing it to spin down and up, and...

>Audit-Trail:
>Unformatted: