tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: vnode lock and v_numoutput



On Sat, Jan 24, 2015 at 05:32:41PM +0000, Taylor R Campbell wrote:
>    Date: Sat, 24 Jan 2015 14:34:24 +0100
>    From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> 
>    I guess what happens is:
>    vndthread is blocked trying to get the vn_lock because the  thread
>    doing the sys_fsync() is holding it.
>    The thread in sys_fsync() waits for v_numoutput to go down to 0,
>    but this won't happen because vndthread has increased v_numoutput but
>    has not queued the I/O yet.
> 
> Whatever v_numoutput is accounting needs to be done alongside the
> accounting if anything is going to cv_wait on it.  It is clearly wrong
> API design (no surprise) that there are many places in the tree that
> read and modify v_numoutput directly.  For example, why does
> genfs_do_io increment v_numoutput by 2??
> 
> Step #1 is to identify what v_numoutput is supposed to be accounting,
> document it, and move all uses of it to a single abstraction in
> vfs_vnode.c...

Yes, I agree it's not clear what this is supposed to do. This could probably
be improved (but not trivial)

> 
>    Should a thread hold the vnode lock (vn_lock()) before increasing v_numoutput ?
> 
> No, I don't think so.  It seems to be the number of pending bufs that
> have been submitted to the disk device to satisfy writes to the vnode,
> so there's no invariant related to the per-file-system state.
> 
> It looks to me as though it is wrong to
> 
> (a) increase v_numoutput,
> (b) wait for the vnode lock, and then
> (c) submit the buf you just counted in v_numoutput,
> 
> because we can't lock the vnode until the buf is completed if fsync
> began waiting between (a) and (b).  (`Submit' in (c) means sending to
> a disk, e.g. with VOP_STRATEGY, or, in this case, just calling
> nestiobuf_done.)
> 
> On the other hand, it is also wrong to
> 
> (a) submit a buf, and then
> (b) increase v_numoutput,
> 
> because there may be outstanding bufs which an fsync initiated between
> (a) and (b) won't notice.
> 
> But is that wrong for the parent buf of a nestiobuf transaction?
> These don't actually correspond to data themselves, and it seems that
> the purpose of the v_numoutput accounting is to notify fsync when all
> data submitted to disk have been written.  So based on this analysis I
> suspect it would be safe to move the v_numoutput++ to just before the
> call to nestiobuf_done in handle_with_strategy.

No, it's not: the call to nestiobuf_done() may be a noop (it probably is
in most case) when skipped is 0. The real biodone() may happen before,
we need to v_numoutput++ before that.
The attached patch increases v_numoutput for the parent buf just before
queing the last nested buf. I think this is the correct fix (or should
we call this a workaround?). With this I couldn't reproduce the problem
on my test machine.

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--
Index: vnd.c
===================================================================
RCS file: /cvsroot/src/sys/dev/vnd.c,v
retrieving revision 1.219.8.2
diff -u -p -u -r1.219.8.2 vnd.c
--- vnd.c	5 Jul 2012 18:12:46 -0000	1.219.8.2
+++ vnd.c	24 Jan 2015 21:41:11 -0000
@@ -785,12 +785,6 @@ handle_with_strategy(struct vnd_softc *v
 
 	flags = obp->b_flags;
 
-	if (!(flags & B_READ)) {
-		vp = bp->b_vp;
-		mutex_enter(vp->v_interlock);
-		vp->v_numoutput++;
-		mutex_exit(vp->v_interlock);
-	}
 
 	/* convert to a byte offset within the file. */
 	bn = obp->b_rawblkno * vnd->sc_dkdev.dk_label->d_secsize;
@@ -862,6 +856,21 @@ handle_with_strategy(struct vnd_softc *v
 			    nbp->vb_buf.b_flags, nbp->vb_buf.b_data,
 			    nbp->vb_buf.b_bcount);
 #endif
+		if (!(flags & B_READ) && resid == sz) {
+			struct vnode *w_vp;
+			/*
+			 * this is the last nested buf, account for
+			 * the parent buf write too.
+			 * This has to be done last, so that
+			 * fsync won't wait for this write which
+			 * has to chance to complete before all nested bufs
+			 * have been queued.
+			 */
+			w_vp = bp->b_vp;
+			mutex_enter(w_vp->v_interlock);
+			w_vp->v_numoutput++;
+			mutex_exit(w_vp->v_interlock);
+		}
 		VOP_STRATEGY(vp, nbp);
 		bn += sz;
 	}


Home | Main Index | Thread Index | Old Index