Subject: Re: WD_SOFTBADSECT & WD_QUIRK_FORCE_LBA48 usage ...
To: None <bouyer@antioche.eu.org>
From: None <davef1624@aol.com>
List: port-i386
Date: 10/04/2005 21:06:49
Manuel - thanks for your previous answers to my WD_SOFTBADSECT 
questions;
I have a few more questions though for you, etc.
(my previous email is attached at the very end) ...

My original question:
>> We're currently using a fairly 'old' wd.c driver & 1.6 NetBSD 
kernel; Nov 1, 2002 to be exact.
>> I'm wondering if there are any critical bug fixes (to either wd.c, 
ata*, pciide* drivers)
>> that might impact disk driver/subsystem reliability and/or error 
recovery since this date?

Your reply:
> Probably, but if you don't have problems, I'm not sure why you worry 
:)


Actually, we are seeing several apparent reliability issues with the 
IDE drives we're using.
Some of the drives experience a bad sector/block after only ~ 5,000 - 
10,000 hours of operation.
In addition, the IDE drive sometimes cannot spare out the bad block.
When we run the 'smartmon' diagnostics on the disk- they usually pass 
the Health Check fine,
but fail the extended diagnostics (usually because of repeated bad read 
errors from the disk).

Also, fsck and other system processes will repeatedly retry reading 
and/or writing these bad blocks:

>kernel: pciide0:1:0: device timeout, c_bcount=8192, c_skip0
>kernel: pciide0 channel 1: reset failed for drive 0
>kernel: wd0a: device timeout reading fsbn 8288336 of 8288336-8288351 
(wd0 bn 8288336; cn 8222 tn 8 sn 56), retrying
>kernel: pciide0:1:0: not ready, st=0x80, err=0x00
>kernel: wd0a: device timeout reading fsbn 8288336 of 8288336-8288351 
(wd0 bn 8288336; cn 8222 tn 8 sn 56), retrying
>kernel: wd0: soft error (corrected)
>kernel: pciide0:1:0: bus-master DMA error: missing interrupt, 
status=0x21
>kernel: pciide0:1:0: device timeout, c_bcount=65536, c_skip0
>kernel: wd0a: device timeout reading fsbn 8343104 of 8343104-8343231 
(wd0 bn 8343104; cn 8276 tn 14 sn 14), retrying


Therefore, I'm looking into any critical fixes that would improve our 
system's resiliency to these kinds of errors;
our system needs to be as robust as possible.

There appear to be several alternatives:

1)  Use the WD_SOFTBADSECT 'automatic bad-sector list' fix - introduced 
on Apr 15, 2003
     (Revision 1.241 of wd.c).
     My question concerns the following (taken from wd(4) man-page):

      > This feature does not interoperate well with the sector 
remapping features of modern disks.
      > To let the disk remap a sector internally, the software bad 
sector list must be flushed or disabled before.

      Can you further explain this to me?    How would I remap a bad 
sector when using WD_SOFTBADSECT?
     I'd like to avoid having to reboot if possible.

2)  Use the WD_QUIRK_FORCE_LBA48 feature.   Can you briefly explain 
this feature to me as well?

3)  Use RAIDframe for data mirroring; we only have one physical drive 
in the system though.
      Is it possible to use RAID to perform data mirroring onto two 
separate file-system partitions on the same drive?
      This would help to protect us from bad disk blocks on an otherwise 
working drive.

Thanks again for your help,
Dave

-----Original Message-----
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: davef1624@aol.com
Cc: port-i386@NetBSD.org; tech-kern@NetBSD.org
Sent: Wed, 28 Sep 2005 19:37:43 +0200
Subject: Re: WD_SOFTBADSECT usage ?

  On Wed, Sep 28, 2005 at 01:52:27AM -0400, davef1624@aol.com wrote:
>
> We're currently using a fairly 'old' wd.c driver & 1.6 NetBSD kernel 
--
> from Nov 1, 2002 to be exact.
>
> I'm wondering if there are any critical bug fixes (to either wd.c,
> ata*, pciide* drivers) that might impact
> disk driver/subsystem reliability and/or error recovery since this 
date?

Probably, but if you don't have problems, I'm not sure why you worry :)

>
> One fix that I noticed was the WD_SOFTBADSECT automatic bad-sector 
list
> management on Apr 15, 2003
> (Revision 1.241 of wd.c).
>
> This fix appears to improve the error recovery of the disk driver by
> not attempting *repeated* reads
> on failed (unrecoverable) disk blocks.
>
> What are the tradeoffs here?  Can I safely turn on this feature?

Probably, as long as you're aware what you need to do to remap a bad 
sector.

--
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--