Re: vget() patch

To: tech-kern%NetBSD.org@localhost
Subject: Re: vget() patch
From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Date: Fri, 27 Nov 2009 20:03:58 +0100

On Wed, Oct 28, 2009 at 06:57:36PM +0100, Manuel Bouyer wrote:
> Hi,
> unless someone objects I'll commit the attached patch and request
> pullup to netbsd-5 next week-end. This is he same as the one posted in the
> "panic due to lost vnode flag" thread. 
> Althoug it doesn't fix this special issue, it seems to fix
> kern/41147 in a better way, and I also hope kern/42205.
> I've been running with this patch on several servers (and the previous
> fix for kern/41147 commented out as in the patch below) without
> problems; this is also being tested by others.

This patch cause a deadlock with layered fs:
the vrele kernel thread can sleep in vrelel() getting the vnode lock.
While it does this another thread may sleep in vget() waiting for
VI_INACTNOW to clear (well, not exactly, but it's looping on vget(NOWAIT)
in layer_node_find() which leads to the same result) and if this
is a vnode in a upper layer of a stacked FS, this thread may hold
the vnode lock because it's the same as the vnode lock from the
corresponding vnode in the lower layer (i.e. in layer_node_find(),
lowervp and the vp we call vget with share the same lock).
So the vrele thread will never get the vnode lock to VOP_INACTIVE
it and the loop in always restart because vget() fails because vp
is in VI_INACTNOW state.

A similar deadlock could happen in layer_node_find() with VI_XLOCK,
it's avoided by skipping vnode with VI_XLOCK set. We can't do the
same for VI_INACTNOW because when VI_INACTNOW is set we don't know
if the vnode will really be cleaned or not. Skipping VI_INACTNOW
vnode would lead to duplicate entries in the layer mount hash.

I think a better way to handle this is to not sleep with VI_INACTNOW
set in vrelel(), this is what the attached patch does.
Before trying to get the vnode lock (when allowed to sleep, in the
LK_NOWAIT case it's not an issue), we drop VI_INACTNOW.
Once we got the lock, we set VI_INACTNOW again and test that
we did not gain another reference while we were slepping. If we
did, we drop VI_INACTNOW, drop the vnode lock and exit from vrelel().
If we didn't gain a reference all is well, we set VI_INACTNOW again and
go on VOP_INACTIVEing the vnode.

Comments ?

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--

Index: vfs_subr.c
===================================================================
RCS file: /cvsroot/src/sys/kern/vfs_subr.c,v
retrieving revision 1.357.4.7
diff -u -p -u -r1.357.4.7 vfs_subr.c
--- vfs_subr.c  21 Nov 2009 20:07:49 -0000      1.357.4.7
+++ vfs_subr.c  27 Nov 2009 18:57:41 -0000
@@ -1411,14 +1411,30 @@ vrelel(vnode_t *vp, int flags)
                        /* The pagedaemon can't wait around; defer. */
                        defer = true;
                } else if (curlwp == vrele_lwp) {
-                       /* We have to try harder. */
-                       vp->v_iflag &= ~VI_INACTREDO;
+                       /*
+                        * We have to try harder. But we can't sleep
+                        * with VI_INACTNOW as vget() may be waiting on it.
+                        */
+                       vp->v_iflag &= ~(VI_INACTREDO|VI_INACTNOW);
+                       cv_broadcast(&vp->v_cv);
                        error = vn_lock(vp, LK_EXCLUSIVE | LK_INTERLOCK |
                            LK_RETRY);
                        if (error != 0) {
                                /* XXX */
                                vpanic(vp, "vrele: unable to lock %p");
                        }
+                       mutex_enter(&vp->v_interlock);
+                       /*
+                        * if we did get another reference while
+                        * sleeping, don't try to inactivate it yet.
+                        */
+                       if (__predict_false(vtryrele(vp))) {
+                               VOP_UNLOCK(vp, 0);
+                               mutex_exit(&vp->v_interlock);
+                               return;
+                       }
+                       vp->v_iflag |= VI_INACTNOW;
+                       mutex_exit(&vp->v_interlock);
                        defer = false;
                } else if ((vp->v_iflag & VI_LAYER) != 0) {
                        /*

Follow-Ups:
- Re: vget() patch
  - From: Joachim König

References:
- vget() patch
  - From: Manuel Bouyer

Prev by Date: Re: 1024bytes/sect microSD support
Next by Date: Re: 1024bytes/sect microSD support
Previous by Thread: Re: vget() patch
Next by Thread: Re: vget() patch
Indexes:

Home | Main Index | Thread Index | Old Index