netbsd-bugs: Re: kern/9856: wd driver loses seriouslly in face of bad blocks

Subject: Re: kern/9856: wd driver loses seriouslly in face of bad blocks
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: John Hawkinson <jhawk@MIT.EDU>
List: netbsd-bugs
Date: 04/13/2000 21:23:53
In message <20000414013114.A398@antioche.eu.org>, Manuel Bouyer writes:
>Ok, I changed it to : "obsolete (was address mark not found)", to make
>it clear.

Sounds good.

>> so when you see that below, that is what is meant by it. I don't really know
>> what to conclude by this error. Perhaps my drive is really ATAv3
>> and happens to support some ATAv4 features like Ultra DMA? I don't know how
>
>This is the case of most drives; manufacturers usually adds new features and
>then try to include them in a standart. Ultra-DMA hardware was available
>before ATA-4 was out (and now the same is true for Ultra/66 :)
>The code guess that if a drive supports Ultra-DMA it supports ATA-4 specs.
>Maybe I should switch back to trust what the disks says, it won't be rigth
>in all cases anyway, as some drives supports only parts of ATA-4 :(

Well, perhaps it deserves a kernel printf() to observe when
they disagree? I guess I can change the #if 0 code in
wdc.c...

>> to debug this. Perhaps CFA REQUEST EXTENDED ERROR CODE could/should be used?
>
>I'll look at this; but later (when I'll add disonnect/reselect and tagged
>command queuing). The problem is that this command is optionnal, and
>I don't have a good way to deal with this yet.

Ah, OK. 

>It should have timed out after 10s. Can you reproduce this reliably ?

Yes, if I try to dd a single block from the drive in 
a quiescenet state (e.g. boot -s; dd if=/dev/rwd0e skip=NNNNNN count=1 of=/dev/null)

>In the trace below there doens't appear to be a timeout.
>Actually I think I've found a bug in kern_clock.c that could explain
>this kind of behavior ...

Depends on which trace you look at ;-) See the one immediately
following "Try reading from a previously-loged bad sector" under
How-To-Repeat. No timeout is logged.


>A "lost interrupt" is always handled as a DMA error. I think the drive should
>have issued an IRQ anyway here. I'm not sure this is the rigth behavior but
>it's pretty easy to fix: please try this :
>
>--- ata_wdc.c.old	Fri Apr 14 01:27:50 2000
>+++ ata_wdc.c	Fri Apr 14 01:27:53 2000

OK, I'll give it a shot.

>> 	I have no idea, and it's very frustrating. There doesn't seem to be
>> any bad-block remapping or marking mechanism available.
>
>No. Modern IDE disks should auto-remap bad blocks, just like SCSI.
>I guess the bad block table of your disk is full (this is where supporting
>SMART would be usefull :)

I looked at the SMART frobs in the ATA4 spec and it wasn't clear to me
that it gave you any useful data at all. Perhaps I was misunderstanding
it, but there were lots of commands to enable/disable SMART but
not anything that I saw that gave useful data on "how many error events"
or any such.

--jhawk