Subject: Re: Funny -> ATA drive read error
To: Charles M. Hannum <abuse@spamalicious.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: netbsd-users
Date: 06/04/2004 22:27:42
On Fri, Jun 04, 2004 at 06:37:42PM +0000, Charles M. Hannum wrote:
> On Friday 04 June 2004 16:37, Manuel Bouyer wrote:
> > On Fri, Jun 04, 2004 at 03:24:59PM +0000, Charles M. Hannum wrote:
> > > This is a relatively new "feature," which I am likely to remove soon,
> > > because it causes exactly the problem you mentioned.  (It also had
> > > another serious bug that I fixed a few days ago -- it caused I/O to
> > > *other* blocks to return EIO.)
> >
> > Note that I'm not the one who implemented this.
> > The fact that it prevent writes is not a feature, it's a bug in the way
> > it's implemented (the test for read is misplaced).
> 
> No, it's not anywhere near that simple.
> 
> 1) It would need to *remove* the bad block entry when a sector is rewritten.

Didn't I mention this in a previous mail ? Maybe this was on another
list.

> 
> 2) Right now, it can mark a large range (up to MAXPHYS) as "bad," but in 
> reality it may only be one sector that's bad.  This is partly because the 
> code was changed a while back to only switch to single-sector I/O after 
> multiple errors.  It can also mark too *small* a range as "bad," thereby 
> causing it to miss the entry later.

Hum, I think it did the right thing at one time. But code path that is not
used often tends to degenerate in brocken code.

> 
> 3) It doesn't scale at all.  It doesn't even *try* to scale.

No. Why should it try ? Once you have more than a few 10s bad blocks on a disk,
you're going to do something, are you ?

> 
> 4) As I've said before, the drive does this kind of defect management itself 
> -- and generally much better.  The only point I see here is to work around 
> the fact that the driver will wedge in busy-wait loops and cause the system 
> to freeze up when it's trying to access a bad block (that is, the drive is 
> doing defect management).  This is a bug and should be fixed, but not in this 
> way.

This should be much better now that these heavy busy-wait are done in a
kernel thread. The remaining problem I can see is that the bus would still
be hung while we're waiting for the disk.

> 
> > I don't remember the exact details that caused this to be implemented; you
> > should probably ask the author.
> >
> > > Also, I recently (a few days ago) eliminated the downgrading of transfer
> > > modes on most errors -- it's pointless, and there's also no way to
> > > recover from that without rebooting.
> >
> > I've seen on many occasions, with different hardware, that downgrading
> > would cause the errors to dissapear (even "ID not found" or "uncorrectable
> > data error" types of errors). I agree the hardware was flacky,
> > but downgrading at last allowed the install to complete.
> 
> I sincerely doubt that downgrading the transfer mode is actually what "fixed" 

It is. I confirmed in some case with hardwired modes in the kernel.
One theory is that the disk would draw less power when running slower, so if
the power supply is too short it helps (I fixed problems like this with
stronger power supply).

> it.  I also did a web search and a PR search, and could find no evidence of 
> cases like this -- although that's not conclusive.
> 
> The point remains that downgrading on an actual bad block -- especially on 
> something like a DVD-ROM -- is just plain wrong.  Downgrading will not fix 

Hum, would it downgrade on a DVD-ROM bad block ? For ATAPI devices
the downgrade should only happen for CRC or protocols errors. If it's
not the case there's a problem here.

> it, and now you've completely screwed performance until you reboot.

I have plans for a tool allowing to control this from userland. I just didn't
have the time to look at it seriously yet.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--