netbsd-bugs: kern/9856: wd driver loses seriouslly in face of bad blocks

Subject: kern/9856: wd driver loses seriouslly in face of bad blocks
To: None <gnats-bugs@gnats.netbsd.org>
From: None <jhawk@MIT.EDU>
List: netbsd-bugs
Date: 04/10/2000 15:31:10
>Number:         9856
>Category:       kern
>Synopsis:       wd driver loses seriouslly in face of bad blocks
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 10 14:32:01 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator:     John Hawkinson
>Release:        NetBSD 1.4.2
>Organization:

>Environment:
	
System: NetBSD 1.4.2 (ZORKMID) #63: Sat Apr  8 12:19:54 EDT 2000
    jhawk@zorkmid.mit.edu:/usr/src/sys/arch/i386/compile/ZORKMID

>Description:
	The wd driver misbehaves rather spectacular in the face of
bad blocks. There are a number of problems. 

1)	The drive returns atapi error number 1 on some reads.
According to the ATAv4 draft I was able to find, this indicates
"obsolete". sys/dev/ata.c's atapi_errno() pritns a nulls string
for this error, rather than anything useful. This results in:

   wd0e:   reading fsbn ...

which is confusing.  I've patched my kernel to report it as "(obsolete)"
so when you see that below, that is what is meant by it. I don't really know
what to conclude by this error. Perhaps my drive is really ATAv3
and happens to support some ATAv4 features like Ultra DMA? I don't know how
to debug this. Perhaps CFA REQUEST EXTENDED ERROR CODE could/should be used?

2)	The wd driver can take *forever* to timeout. Where forever == minutes.
Please see below (How-to-Repeat) for an example of a 1-block read 
that took 11 minutes to fail. Note that the kernel printfs regarding
disk errors did not take place until 9 minutes into it. Specfically,
9m6s and 9m14s for the (obsolete) errors, 9m23s for the
uncorrectable+downgrade, and 9m35s for the next (obsolete).

This is pretty hokey.

3)	The driver seems to spend a lot of time in each read, and then
downgrading the transfer mode, and then rereading the same blocks.
I don't have precise timings on this one, but I earlier had a case
where it went through ~30seconds each trying to read a single
block and getting 3 (obsolete) errors, then downgraded to Ultra-DMA1,
tried twice with (obsolete) errors, then saw:

pciide0:0:0: lost interrupt
        type: ata
        c_bcount: 37888
        c_skip: 19456
wd0e: device timeout reading fsbn 7455094 of 7455056-7455167 (wd0 bn 12899239; c
n 13649 tn 14 sn 52)
wd0e:  uncorrectable data error reading fsbn 7455056 of 7455056-7455167 (wd0 bn 
12899201; cn 13649 tn 14 sn 14), retrying
wd0e:  uncorrectable data error reading fsbn 7455056 of 7455056-7455167 (wd0 bn 
12899201; cn 13649 tn 14 sn 14), retrying

then downgraded to PIO mode 4.

It doens't seem like the downgrades were ever necessary or appropriate
(this is not a DMA problem, it is a physical problem, presumably), yet
they happened regardless and took a long time to effect themselves.


>How-To-Repeat:

Kernel probe:

NetBSD 1.4.2 (ZORKMID) #63: Sat Apr  8 12:19:54 EDT 2000
    jhawk@zorkmid.mit.edu:/usr/src/sys/arch/i386/compile/ZORKMID
cpu0: family 6 model 8 step 1
cpu0: Intel Pentium Pro, II or III (686-class)
real mem  = 66650112
avail mem = 57622528
using 839 buffers containing 3436544 bytes of memory
mainbus0 (root)
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o enabled, memory enabled
pchb0 at pci0 dev 0 function 0
pchb0: Intel 82443BX Host Bridge/Controller (rev. 0x03)
ppb0 at pci0 dev 1 function 0: Intel 82443BX AGP Interface (rev. 0x03)
pci1 at ppb0 bus 1
pci1: i/o enabled, memory enabled
vga1 at pci1 dev 0 function 0: Neomagic product 0x0005 (rev. 0x20)
wsdisplay0 at vga1: console (80x25, vt100 emulation)
pcib0 at pci0 dev 7 function 0
pcib0: Intel 82371AB PCI-to-ISA Bridge (PIIX4) (rev. 0x02)
pciide0 at pci0 dev 7 function 1: Intel 82371AB IDE controller (PIIX4)
pciide0: bus-master DMA support present
pciide0: primary channel wired to compatibility mode
wd0 at pciide0 channel 0 drive 0: <TOSHIBA MK8113MAT>
wd0: drive supports 16-sector pio transfers, lba addressing
wd0: 7815MB, 16938 cyl, 15 head, 63 sec, 512 bytes/sect x 16006410 sectors
wd0: 32-bits data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2
pciide0: primary channel interrupting at irq 14
pciide0: secondary channel wired to compatibility mode
pciide0: disabling secondary channel (no drives)
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2 (using DMA data transfers)

Try reading from a previously-logged bad sector:
	
# date; time dd skip=7455056 count=1 if=/dev/rwd0e of=/dev/null; date
Mon Apr 10 16:48:33 EDT 2000
wd0e:  (obsolete) reading fsbn 7251341 (wd0 bn 12695486; cn 13434 tn 5 sn 41), r
etrying
wd0e:  (obsolete) reading fsbn 7251341 (wd0 bn 12695486; cn 13434 tn 5 sn 41), r
etrying
wd0e:  uncorrectable data error reading fsbn 7251341 (wd0 bn 12695486; cn 13434 
tn 5 sn 41), retrying
wd0: transfer error, downgrading to Ultra-DMA mode 1
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 1 (using DMA data transfers)
wd0e:  (obsolete) reading fsbn 7251341 (wd0 bn 12695486; cn 13434 tn 5 sn 41), r
etrying
wd0e:  (obsolete) reading fsbn 7251341 (wd0 bn 12695486; cn 13434 tn 5 sn 41), r
etrying
wd0e:  (obsolete) reading fsbn 7251341 (wd0 bn 12695486; cn 13434 tn 5 sn 41)
dd: /dev/rwd0e: Input/output error
      630.95 real         4.11 user            322.28 sys
Mon Apr 10 16:59:05 EDT 2000
>Fix:
	I have no idea, and it's very frustrating. There doesn't seem to be
any bad-block remapping or marking mechanism available.
>Release-Note:
>Audit-Trail:
>Unformatted: