current-users: IDE bad sector remapping vs. write-cache

Subject: IDE bad sector remapping vs. write-cache
To: None <current-users@netbsd.org>
From: Daniel Carosone <dan@geek.com.au>
List: current-users
Date: 08/17/2003 17:29:46
Modern disks are supposed to automatically remap bad sectors when
an error occurs; either after a read error that is successfully
retried, or on write. Fully bad sectors that can't be read won't
get remapped, because the drive doesn't know what to put in the
replacement sector - these sectors have to wait until they're
overwritten with new data to be remapped.

However, it seems that (some?) IDE disks may not do bad sector
remapping on write if their write-cache is enabled.

I had a disk report a few bad (unreadable) sectors during backups
yesterday. So, after having dealt with making sure I had all the
data I needed, I decided to overwrite the disk to see if I could
force the bad sectors to be remapped.

I cleared the in-kernel badsector list with dkctl before each pass.

I dd'd /dev/zero over the partition (no write errors were reported),
and then tried to read it back to check that the sectors had been
remapped.  I still got errors for the same sectors.

I dd'd again, same result.

Because I'd been playing with dkctl, I decided to see if disabling
the write cache on the drive made any difference. I dd'd over the
disk again, and this time the entire disk read back perfectly!

I restored filesystems to it, compared them against the backups,
wrote and read large files, re-read the entire disk surface, and
generally did a bunch of testing looking for more bad sectors.
Couldn't provoke a single error.

So, just for giggles, I dug out an older disk that failed a year
or so ago.  At the time, I'd tried the same trick to trigger
remapping without success - the disk would write out fine, but not
read back. On this other disk, the read errors were really audible,
the drive would make horrid clacking and grinding noises in attempting
to retry reading the bad sectors.

I re-did the first test, with write-cache on. Same result as a year
ago.  I turned off the write-cache, rewrote it, and this disk is
also now working perfectly!

The disks are different manufacturers and types:

  Model: ST340016A, Rev: 3.19
  Model: IBM-DTLA-307030, Rev: TX4OA60A

Does anyone have any knowledge to add? Is this a deliberate or
known "feature"?

Its hard to test, unless you have a disk that's already failed. If
you do, please try this test. (You still may not want to trust the
disk, but I'd love to know if you can replicate this behaviour).

--
Dan.