tech-kern: graceful continuation and recovery on temporary disk failure

Subject: graceful continuation and recovery on temporary disk failure
To: None <tech-kern@netbsd.org>
From: Matthias Buelow <mkb@mukappabeta.de>
List: tech-kern
Date: 09/04/2003 22:05:23
Hi folks,

I just had an experience on Solaris that I found nothing
but sheer amazing;

I've got an old SS5 here with Solaris 9 on it, with an internal disk
(system, swap, /usr etc.) and an external one with home dirs and some
other stuff.  As it seems, this morning, the cleansing woman scrubbed
the floor a bit too violently and caught the power cable for the
external disk.  At least, the external disk was powered down.  I didn't
notice this for a while because the system was operating normally (I
could log in, run commands etc., write to the the non-existant disk)
but very slow, as if it was swapping heavily, but I was a bit doubtful
about this since I couldn't hear the grinding disk noise that
accompanies the machine when trashing.  Still at first, I thought that
I had some amok-running mozilla firebird left running for the night and
it had consumed a large chunk of swap (wouldn't be for the first time)
but I found none; and, while a couple dozen megs more swap than usually
were in use, I still had ~600mb free.  I continued to look at what
dmesg would say and it spew a couple hundred of the following:

Sep  4 20:42:10 xxx scsi: [ID 107833 kern.warning] WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@0,0 (sd0):
Sep  4 20:42:10 xxx    disk not responding to selection

Only then did I look at the disk and see that the power LED was off and
I somehow became aware that the characteristic whirring sound of an
older scsi disk was missing.  A couple moments later, I checked the
socket where the power cable goes in, firmly reattached the cable and
the disk was spinning back up.  And the system's response time was back
to normal!

I'm not well acquainted with Solaris internals but from what I've
observed, I guess something like the following has happened:

* Power to the external disk went down, and the system noticed that
  it cannot talk to the disk anymore.

* Instead of panicking, or failing/hanging processes that could not
  write to the disk (even sync(1m) didn't hang), it rerouted writes for
  the failed device to some other data store -- probably swap, or the
  VM system in general (although I'm not entirely sure of that part.)

* On each write, or at least, periodically, it re-tried the failed
  disk, got an error, and used the alternative path to store data
  somewhere (that would explain the lagging performance.)

* User processes weren't affected at all by this, only the kernel
  logged the above warning.

* The machine has been up for 247 days, so I was probably lucky
  that the data in my homedir was (mostly) present in the buffer
  cache at that time and no read failures occurred.

* Once the disk was back up again, writes didn't fail anymore
  and performance was back as normal.  Since it now has less
  swap in use than just before, perhaps it had written some
  of the data to swap, as speculated above, and now wrote it back
  to the disk or at least into ordinary buffers, where it originally
  should've ended up.  Maybe it just swapped out block buffers for
  the affected device, which might be a reasonable thing to do,
  in that special situation.

Now that's some kind of graceful recovery from error situations that I
haven't seen yet on workstation-style systems.  It would be extremely
nice if some behaviour like this could be implemented on NetBSD
aswell.  AFAIK, so far NetBSD reports either write errors, or hangs
processes, that attempt to write to a failed device.  The system could
aswell figure out that there is space left somewhere (for example
virtual memory or swap) where it could temporarly store the data, or
it could swap out the affected parts of the buffer cache, in the hope
that the failed device might come up back again in the near future
(of course that means that there's no swap, or at least swap that
needn't be used for the procedure, on the failed device(s)).

Hubert Feyrer told me on IRC that he believes that the actual behaviour
of the system on such a failure could be administratively set on
Solaris, for example through a mount option; yet the filesystem was
mounted with the default onerror=panic.  Maybe it wasn't the kind of
failure that "onerror" would be triggered on.

Maybe we also have some Solaris wizards reading here that could further
shed some light on the issue. :)

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}