Subject: [solution] vnode refcount panic, perhaps due to kern/vfs_lookup.c:lookup()
To: None <tech-kern@netbsd.org>
From: Greg Troxel <gdt@ir.bbn.com>
List: tech-kern
Date: 03/16/2003 20:23:01
I think I have found the problem, but I'm not 100% sure since the
crashes were not repeatable by a fixed number of operations.  It used
to crash after 3-5 'modify buffer, ^X u' operations, and now I've gone
20 or so.  (duh this is emacs so off goes a keyboard macro to do '
^X^S' 200 times, causing massive coda lock printouts
(coda_lockdebug=1).  After, repeating with ' ^Xu' 200 times, I'm
pretty convinced that at the very least the code is closer to correct

In coda/coda_vnops.c:coda_symlink, vput is called before lookup on the
parent vnode.  This seemed wrong to me - dropping a reference and then
using the vnode.  I took out the vput and got 'locking against
myself'.  After reading the code more, I came to the conclusion:

  lookup expects a referenced vnode, but not one that's locked

So, I replaced vput with VOP_UNLOCK, so that the vn_lock in lookup
would be happy, but without dropping the reference.  I think this lost
reference was casuing the vnode to get reclaimed at a surprising time,
causing the VBAD symptom.

I'd appreciate comments from those who are more vfs-clueful than I am
- is my interpretation of lookup's calling conventions and the
referencing rules correct?

Here is a patch against netbsd-1-6 as of 20030206 (from my own repo).
The first and third hunks are attempts to notice the VBAD problem
earlier rather than later; the second hunk is the fix - just dropping
vput in favor of VOP_UNLOCK.

I'm happy to send a clean patch and file a PR if that's helpful.

An aside: I suppose that since lookup(9) and the comments in the
sources don't say that the vnode is to be locked on entry, therefore
the caller should hold a reference and the vnode should be unlocked.
This is arguably implied by vnode(9), but it took me a long time to
figure this out.  The notion of holding a reference is pretty
straightforward, but the notion that one passes that reference into a
function for functions that don't "return" their directory arguments
was a bit confusing - some explicit language about when functions
consume references and that returned values are ref'd might be
helpful.  Now that I understand it it seems obvious, so it could just
be me being confused.

Thanks to Bill and Jaromir for helpful hints in understanding vnodes!

--- coda_vnops.c.~1.1.1.2.~	Wed Dec  5 23:27:40 2001
+++ coda_vnops.c	Sun Mar 16 19:50:58 2003
@@ -1640,8 +1640,15 @@
     error = venus_symlink(vtomi(tdvp), &tdcp->c_fid, path, plen, nm, len, tva, cred, p);
 
     /* Invalidate the parent's attr cache, the modification time has changed */
+    /* XXX what does this do to the vnode? */
     tdcp->c_flags &= ~C_VATTR;
 
+    if (tdvp->v_type == VBAD) {
+	    error = ENODEV;	/* not really right, but want unique */
+	    vprint("venus_symlink VBAD: ", tdvp);
+	    return error;	/* XXX CODADEBUG */
+    }
+
     if (!error)
     {
 	struct nameidata nd;
@@ -1652,9 +1659,16 @@
 	nd.ni_cnd.cn_pnbuf = (char *)nm;
 	nd.ni_cnd.cn_nameptr = nd.ni_cnd.cn_pnbuf;
 	nd.ni_pathlen = len;
-	vput(tdvp);
+
+	/*
+	 * tdvp is refed and VOP_LOCKed.  lookup seems to require it to
+	 * be referenced (as do all calls ?) but not locked - it locks it.
+	 * So, drop the lock but not the reference.
+	 */
+	VOP_UNLOCK(tdvp, 0);
 	error = lookup(&nd);
 	*ap->a_vpp = nd.ni_vp;
+
     }
 
     /* 
@@ -1980,6 +1994,10 @@
 {
     struct cnode *cp;
     int          err;
+
+    if (type == VBAD) {
+	    panic("make_coda_node: VBAD");
+    }
 
     if ((cp = coda_find(fid)) == NULL) {
 	struct vnode *vp;