Subject: Re: kern/9856: wd driver loses seriouslly in face of bad blocks
To: None <jhawk@MIT.EDU>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: netbsd-bugs
Date: 04/14/2000 01:31:14
On Mon, Apr 10, 2000 at 05:31:41PM -0400, jhawk@MIT.EDU wrote:
>
> The wd driver misbehaves rather spectacular in the face of
> bad blocks. There are a number of problems.
>
> 1) The drive returns atapi error number 1 on some reads.
> According to the ATAv4 draft I was able to find, this indicates
> "obsolete". sys/dev/ata.c's atapi_errno() pritns a nulls string
> for this error, rather than anything useful. This results in:
>
> wd0e: reading fsbn ...
>
> which is confusing. I've patched my kernel to report it as "(obsolete)"
Ok, I changed it to : "obsolete (was address mark not found)", to make
it clear.
> so when you see that below, that is what is meant by it. I don't really know
> what to conclude by this error. Perhaps my drive is really ATAv3
> and happens to support some ATAv4 features like Ultra DMA? I don't know how
This is the case of most drives; manufacturers usually adds new features and
then try to include them in a standart. Ultra-DMA hardware was available
before ATA-4 was out (and now the same is true for Ultra/66 :)
The code guess that if a drive supports Ultra-DMA it supports ATA-4 specs.
Maybe I should switch back to trust what the disks says, it won't be rigth
in all cases anyway, as some drives supports only parts of ATA-4 :(
> to debug this. Perhaps CFA REQUEST EXTENDED ERROR CODE could/should be used?
I'll look at this; but later (when I'll add disonnect/reselect and tagged
command queuing). The problem is that this command is optionnal, and
I don't have a good way to deal with this yet.
>
> 2) The wd driver can take *forever* to timeout. Where forever == minutes.
> Please see below (How-to-Repeat) for an example of a 1-block read
> that took 11 minutes to fail. Note that the kernel printfs regarding
> disk errors did not take place until 9 minutes into it. Specfically,
> 9m6s and 9m14s for the (obsolete) errors, 9m23s for the
> uncorrectable+downgrade, and 9m35s for the next (obsolete).
>
> This is pretty hokey.
It should have timed out after 10s. Can you reproduce this reliably ?
In the trace below there doens't appear to be a timeout.
Actually I think I've found a bug in kern_clock.c that could explain
this kind of behavior ...
>
> 3) The driver seems to spend a lot of time in each read, and then
> downgrading the transfer mode, and then rereading the same blocks.
> I don't have precise timings on this one, but I earlier had a case
> where it went through ~30seconds each trying to read a single
> block and getting 3 (obsolete) errors, then downgraded to Ultra-DMA1,
> tried twice with (obsolete) errors, then saw:
>
> pciide0:0:0: lost interrupt
> type: ata
> c_bcount: 37888
> c_skip: 19456
> wd0e: device timeout reading fsbn 7455094 of 7455056-7455167 (wd0 bn 12899239; c
> n 13649 tn 14 sn 52)
> wd0e: uncorrectable data error reading fsbn 7455056 of 7455056-7455167 (wd0 bn
> 12899201; cn 13649 tn 14 sn 14), retrying
> wd0e: uncorrectable data error reading fsbn 7455056 of 7455056-7455167 (wd0 bn
> 12899201; cn 13649 tn 14 sn 14), retrying
>
> then downgraded to PIO mode 4.
>
> It doens't seem like the downgrades were ever necessary or appropriate
> (this is not a DMA problem, it is a physical problem, presumably), yet
> they happened regardless and took a long time to effect themselves.
A "lost interrupt" is always handled as a DMA error. I think the drive should
have issued an IRQ anyway here. I'm not sure this is the rigth behavior but
it's pretty easy to fix: please try this :
--- ata_wdc.c.old Fri Apr 14 01:27:50 2000
+++ ata_wdc.c Fri Apr 14 01:27:53 2000
@@ -498,7 +498,9 @@
}
if (drv_err != WDC_ATA_ERR)
goto end;
- ata_dmaerr(drvp);
+ if (drvp->drive_flags & DRIVE_UDMA &&
+ (ata_bio->r_error & WDCE_CRC))
+ ata_dmaerr(drvp);
}
/* if we had an error, end */
> >Fix:
> I have no idea, and it's very frustrating. There doesn't seem to be
> any bad-block remapping or marking mechanism available.
No. Modern IDE disks should auto-remap bad blocks, just like SCSI.
I guess the bad block table of your disk is full (this is where supporting
SMART would be usefull :)
--
Manuel Bouyer <bouyer@antioche.eu.org>
--