Source-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[src/trunk]: src/sys/dev/raidframe Yesterday's fix to rf_disks.c (rev 1.51) w...



details:   https://anonhg.NetBSD.org/src/rev/4423f37a02c9
branches:  trunk
changeset: 559678:4423f37a02c9
user:      oster <oster%NetBSD.org@localhost>
date:      Sun Mar 21 21:08:08 2004 +0000

description:
Yesterday's fix to rf_disks.c (rev 1.51) was necessary, but not
sufficient to clobber this nasty little bug.  The behaviour observed
was a panic when doing a 'raidctl -f' on a component when DAGs were
in flight for the given RAID set.  Unfortunatly, the faulty behaviour
was very intermittent, and it was difficult to not only reliably
reproduce the bug (nor determine when it was fixed!) but also to even
figure out what might be the cause of the problem.

The real issue was that ci_vp for the failed component was being
set to NULL in rf_FailDisk(), but with DAGs still in flight, some
of them were still expecting to use ci_vp to determine where to
read to/write from!

The fix is to call rf_SuspendNewRequestsAndWait() from rf_FailDisk()
to make sure the RAID set is quiet and all IOs have completed before
mucking with ci_vp and other data structures.  rf_ResumeNewRequests()
is then used to continue on as usual.

diffstat:

 sys/dev/raidframe/rf_driver.c |  15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diffs (43 lines):

diff -r 887b2715d4ea -r 4423f37a02c9 sys/dev/raidframe/rf_driver.c
--- a/sys/dev/raidframe/rf_driver.c     Sun Mar 21 21:02:01 2004 +0000
+++ b/sys/dev/raidframe/rf_driver.c     Sun Mar 21 21:08:08 2004 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: rf_driver.c,v 1.97 2004/03/20 04:22:05 oster Exp $     */
+/*     $NetBSD: rf_driver.c,v 1.98 2004/03/21 21:08:08 oster Exp $     */
 /*-
  * Copyright (c) 1999 The NetBSD Foundation, Inc.
  * All rights reserved.
@@ -73,7 +73,7 @@
 
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: rf_driver.c,v 1.97 2004/03/20 04:22:05 oster Exp $");
+__KERNEL_RCSID(0, "$NetBSD: rf_driver.c,v 1.98 2004/03/21 21:08:08 oster Exp $");
 
 #include "opt_raid_diagnostic.h"
 
@@ -619,6 +619,13 @@
 int 
 rf_FailDisk(RF_Raid_t *raidPtr, int fcol, int initRecon)
 {
+
+       /* need to suspend IO's here -- if there are DAGs in flight
+          and we pull the rug out from under ci_vp, Bad Things 
+          can happen.  */
+
+       rf_SuspendNewRequestsAndWait(raidPtr);
+
        RF_LOCK_MUTEX(raidPtr->mutex);
        if (raidPtr->Disks[fcol].status != rf_ds_failed) {
                /* must be failing something that is valid, or else it's
@@ -646,6 +653,10 @@
 
        raidPtr->Disks[fcol].auto_configured = 0;
        RF_UNLOCK_MUTEX(raidPtr->mutex);
+       /* now we can allow IO to continue -- we'll be suspending it
+          again in rf_ReconstructFailedDisk() if we have to.. */
+
+       rf_ResumeNewRequests(raidPtr);
 
        if (initRecon)
                rf_ReconstructFailedDisk(raidPtr, fcol);



Home | Main Index | Thread Index | Old Index