port-i386: WD_SOFTBADSECT usage ?

Subject: WD_SOFTBADSECT usage ?
To: None <bouyer@antioche.eu.org>
From: None <davef1624@aol.com>
List: port-i386
Date: 09/28/2005 01:52:27
We're currently using a fairly 'old' wd.c driver & 1.6 NetBSD kernel -- 
from Nov 1, 2002 to be exact.

I'm wondering if there are any critical bug fixes (to either wd.c, 
ata*, pciide* drivers) that might impact
disk driver/subsystem reliability and/or error recovery since this date?

One fix that I noticed was the WD_SOFTBADSECT automatic bad-sector list 
management on Apr 15, 2003
(Revision 1.241 of wd.c).

This fix appears to improve the error recovery of the disk driver by 
not attempting *repeated* reads
on failed (unrecoverable) disk blocks.

What are the tradeoffs here?  Can I safely turn on this feature?
I can see from the wd(4) man-page, under *BUGS*:

The optional software bad sector list does not interoperate well with
sector remapping features of modern disks.  To let the disk remap a 
sector internally,
the software bad sector list must be flushed or disabled before.

Thanks for your help,
Dave F


-----Original Message-----
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: davef1624@aol.com
Cc: port-i386@NetBSD.org; tech-kern@NetBSD.org
Sent: Wed, 24 Aug 2005 00:02:00 +0200
Subject: Re: wd0 intermittent disk errors (correctable soft-errors, DMA 
error: missing interrupt, etc.)

  On Tue, Aug 23, 2005 at 05:25:22PM -0400, davef1624@aol.com wrote:
>
> I have a 2 GHz, Pentium-4 based system, using 40 GB Hitachi 
Travelstars
> IDE disks.
>
> We are seeing the following errors intermittently on the system:
>
> wd0a: error reading fsbn 512864 of 512864-512991 (wd0 bn 512864; cn 
508
> tn 12 sn 44), retrying
> wd0: (aborted command, interface CRC error)
> wd0: soft error (corrected)

This is harmless as long as it doesn't occur often. This means that
the data got corrupted during transfers on the IDE bus, and this was
detected by the Ultra-DMA CRC function (in this case the driver just 
redo the
transfers). It's expected to see occasionnal CRC errors on Ultra-DMA IDE
busses, this bus just can't do reliable data transmission at this speed
(PATA Ultra-DMA could be called a hardware hack :)

>
> In addition, we sometimes see the following disk/driver errors:
>
> pciide0:1:0: bus-master DMA error: missing interrupt, status=0x20
> pciide0:1:0: device timeout, c_bcount=8192, c_skip0
> pciide0 channel 1: reset failed for drive 0
> wd0a: device timeout writing fsbn 8236512 of 8236512-8236527 (wd0 bn
> 8236512; cn 8171 tn 2 sn 18), retrying
> pciide0:1:0: not ready, st=0x80, err=0x00
> pciide0 channel 1: reset failed for drive 0
> wd0a: device timeout writing fsbn 8236512 of 8236512-8236527 (wd0 bn
> 8236512; cn 8171 tn 2 sn 18), retrying
> pciide0:1:0: not ready, st=0x80, err=0x00
> wd0a: device timeout writing fsbn 8236512 of 8236512-8236527 (wd0 bn
> 8236512; cn 8171 tn 2 sn 18), retrying

This is more serious, this means the drive is stalled, it doens't
even honnor the reset signal. I guess the drive doesn't recover from 
this ?
Maybe it's a drive firmware issue, maybe it's just dying ...

I've seen this on occasion on sparc64 system, I suspect it's a 
read/write
reordering issue on this platform. But I've never seen it on PCs.

--
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--