port-sparc64: Re: more info about "wd0: (uncorrectable data error)"

Subject: Re: more info about "wd0: (uncorrectable data error)"
To: None <port-sparc64@netbsd.org>
From: Miles Nordin <carton@Ivy.NET>
List: port-sparc64
Date: 01/06/2007 19:57:45
--pgp-sign-Multipart_Sat_Jan__6_19:57:34_2007-1
Content-Type: text/plain; charset=US-ASCII

>>>>> "jc" == Joel CARNAT <joel@carnat.net> writes:

    jc> where: 

    jc> - Windows just goes blue screen 

    jc> - Linux froze the shell where the access to the disk was
    jc> started (cd, cp, or such)

    jc> - NetBSD dropped thousand of messages

IMHO, if the error is in a data block the kernel should return EIO to
whatever was reading.  If the error is in metadata it should remount
the filesystem read-only and return EIO as needed.  If the error is on
an mmap'ed page, I don't know what it should do.  And if the error is
on the swap partition it should kill any process it was trying to swap
in, and blacklist those disk pages.

This is exactly the sorts of things Solaris claims to do in Solaris 10
since they started their ``green line'' marketing campaign.  They
claim they can retire memory modules and CPU's as well.  But it's a
god damned steaming pile of lies.  Solaris panics or freezes forever,
sometimes when just one component of a supposedly-mirrored ZFS vdev
goes away.  On the mailing list they say ``ZFS is not integrated with
FMA yet.''  Any year now.  At least they know what they're _trying_ to
do, though!

I've found that in general NetBSD and (recent) Linux can usually get
through a 'dd conv=noerror,sync bs=512' on a failing disk, though it
sometimes takes a week.  Other than that capability, all bets are off.

As for why it freezes parts of your system totally unrelated to the
block that went illedgible, probably because the disk takes tens of
seconds to return failure, and won't service other requests in the
mean time.

I could imagine a world where this is fixed.  There are ``mode pages''
on the disk that you could tweak on the pre-Jobs Mac OS using ``FWB
Hard Disk Toolkit'', to ask the disk to return failure sooner.  If the
disk could be convinced to continue servicing a tagged queue while
concurrently doing the retry, that might help through a different
mechanism that gives the disk more freedom in its recovery protocol,
though you would need some combination of the two tricks to avoid
filling the short queue with read-requests for illedgible sectors.

But I've never heard of anyone using either approach.  Since most disk
drive customers don't do it, it seems like you would need to hold a
software support contract for your disk's firmware to pull it off
consistently.  The mode pages aren't always implemented, and a
marginal-disk-simulator would be valuable.  

I wonder what EMC and Hitachi do.  For writes, it's no problem because
they have their fancy NVRAMs.  For reads, maybe just indulge disks
that freeze up: if a disk doesn't answer within some subsecond
interval, don't wait for it to report fialure.  Dispatch the same read
to another RAID component, on expiry of the OS's _own_ timer rather
than the disk's?  Update a disk-demerit counter, and retire the disk
if it freezes too often?

Anyway, what I mean to say is, I think (good) RAID avoids more than
just data loss.  Without it, collecting logs of a slowly-failing disk
to find and fix the problem is harder.  Things like mirrored swap make
sense.

--pgp-sign-Multipart_Sat_Jan__6_19:57:34_2007-1
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (NetBSD)

iQCVAwUARaBFiYnCBbTaW/4dAQKHawP+ItQJg5ogibkVEZu5m3YVaWz4lqMdLi0N
3aPFlVMqtRvvZ8owZquIGfk3g57b29ptQA85TmHKblke84X37PAbAkx5m2HBZdQM
SQ29O+sUa8rCd4sbagQcvEuRcn0CCQ+B2xm0D8nL78N8/44uTQY1T1OM5MKvfSi+
V0sTQCZVQgQ=
=O3s7
-----END PGP SIGNATURE-----

--pgp-sign-Multipart_Sat_Jan__6_19:57:34_2007-1--