Subject: Re: Bad sectors vs RAIDframe
To: Thor Lancelot Simon <tls@rek.tjls.com>
From: Daniel Carosone <dan@geek.com.au>
List: tech-kern
Date: 06/07/2005 07:29:56
--GlsaLUDw2IYctviq
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Jun 06, 2005 at 01:06:19PM -0400, Thor Lancelot Simon wrote:
> On Mon, Jun 06, 2005 at 11:44:51AM -0500, J Chapman Flack wrote:
> > Thor Lancelot Simon wrote:
> > > Most IDE drives only spare out sectors on *write* (one must ask: what,
> > > exactly, could they do to avoid presenting a read error on read -- and
> >=20
> > If the question wasn't rhetorical, I think the answer is, it's a "read =
error"
> > if the drive had to apply ECC to recover the correct data; then it reas=
signs
> > the block, writes the recovered data to the new block, and returns the
> > recovered data to the host.
>=20
> Right, so, there are two problems here.
>=20
> First, even if some errors are correctable with ECC, some aren't.  Is it
> correct for the drive to automatically spare out on an _uncorrectable_
> error?  If it does so, and the host retries the read, it will get back
> a block full of zeroes -- which will cause a particularly ugly kind of
> data corruption in a parity RAID setup.
>=20
> Given the limited error-reporting semantics available to IDE or SATA
> disks, it's probably actually correct for them to report error and not
> spare the sector in this case. =20

This is (from my observations) exactly what happens.  Drives can remap
on read only when the data is recoverable via ECC; if not, they will
report uncorrectable read errors - they will never "make up" data and
pretend it is correct.  It doesn't matter whether the drive is in a
RAID or not (except that the host/controller has a chance to find or
calculate another copy of the data, external to the drive, of course).

This is one good reason for doing "patrol reads"; dd'ing each disk off
to /dev/null regularly, for example.  Normally, taking regular backups
is also a good way to ensure that all (active) sectors are readable,
and trigger remaps on marginal sectors[*], but in a mirrored RAID this
may in fact not achieve what's needed. Some raid controllers try to
minimise seek distances by dividing the raid set and biasing reads
from the front half from one mirror, and the rear from the other; this
risks leaving half of each disk unread or rarely-read.  You need to
read the submirrors.  With RAIDframe, doing a parity check/re'write'
also achieves this.

On a write, there is nothing left caring about the original unreadable
contents of the sector, so the drive is free to remap.  But, as I've
observed previously, I'm convinced that drives don't always do this
unless write cache is off - so they may write something new to the
sector and *not notice* that it will be unreadable next time.

And yes, it would be nice if RF would implement 'sector repair', or
simply have the ability to fail and repair individual stripes/parity
units, rather than whole drives, since (as Thor knows only too well)
you often get sectors failing in different spots on multiple drives.

I've found that it tends to happen a lot on new disks, where the drive
hasn't yet had a chance to test each sector properly.  I run a few
passes over each new disk writing random data with write cache off to
exercise/exorcise marginal sectors like this, and it really seems to help.

> The other problem is that some IDE drive firmware is so cheap that it
> knows only two states for sectors: okay or error.  So on such drives
> it seems to be the case that you're guaranteed that sectors that go
> bad, ever at all, will stay that way until you force sparing by
> writing them back yourself.

NetBSD's wd driver also has this facility, and it was enabled by
default at one time (no longer), in an effort to stop the host
grinding to a halt retrying reads that were never going to succeed.
Unfortunately, it wasn't smart enough to let writes through and cancel
the bad sector flag for that sector to try again.

--
Dan.

[*] I gather that it's not really the 'image' on the platter surface
that degrades over time, so much as the sensitivity of the GMR heads
to be able to read; either way, the net effect is the same: sectors
that were once readable are no longer readable some time later.  If
that time span is long enough to go from 'fine' to 'unreadable'
without an intervening 'marginal, remap this sector while we still
can', you're in trouble.

Note also that 'marginal' is a threshold; many drives rely on ECC
reconstruction even for normal reads. You can see this in SMART stats.

--GlsaLUDw2IYctviq
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (NetBSD)

iD8DBQFCpMBUEAVxvV4N66cRAnGmAKDcHZ6rcED69TQnkpJnRRjzyb2l+gCfRDVs
fSngvmK06piQOmCsoJ8LVWM=
=FRk1
-----END PGP SIGNATURE-----

--GlsaLUDw2IYctviq--