Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Known issue with raidframe, or should I file a bug?



        hello.  I've got a NetBSD-4 box which is runing raidframe with a bunch
of disks raided together in  raid5 fashon.  Right now, we're experiencing a
rash of disk failures which is causing me to exercise the recovery
procedures of the raidframe subsystem quite heavily.   This morning, I ran
into what looks like a bug and I wonder if  anyone has  seen it before and
if it's a known issue, or if I should file a pr against it?
        the issue is that if the drive fails which you're reconstructing to,
you get an error message telling you that the  reconstruction failed, but
the raid_recon process doesn't exit.  Instead, it just sits there in
raidframe wait until you reboot.  What this means is that if you have
another spare disk to use, you can't try reconstructing to it until you
reboot.
        I believe this worked in NetBSD 2.x and 3.x, at least I think I've had
this happen before and I was able to recover without rebooting.  It
admittedly doesn't happen that frequently, but given the nature of disk
failures now days, it seems like it would be good to try and minimize
situations where you have to reboot to move to the next step to try and
recover your ailing raid subsystem.  At least for us, we're finding that
Google's research is correct.  That is, drives purchased at the same time
from the same manufacturer seem to fail at the same time, thus leading to
very heavy use of the raidframe software's recovery features.  We saw it
with the IBM Death Stars, and we're now seeing it with Western Digital
disks purchased mid 2006.
        If this is not a known issue, I'll file a pr on it.

-thanks
-Brian


Home | Main Index | Thread Index | Old Index