Subject: pr 32535 followup
To: None <>
From: Bill Studenmund <>
List: tech-kern
Date: 11/28/2006 10:48:18
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

PR 32535 describes a layered file system issue we have been seeing. The=20
problem is that when we traverse a path, we have the parent dir locked, we=
get the child, the VOP_LOOKUP() call unlocks the parent, then we vrele()=20
the parent. If the use count is zero, vrele() will lock the node and call=

Strictly speaking, we are locking a vnode's parent while holding=20
it locked which is a violation of the locking protocol. Normally that's=20
fine as the use count won't be zero if anything else is using the vnode,=20
and someone holding a lock on the parent and waiting for the vnode would=20
have a use reference.

The problem is when we have file system layers involved. We lock the whole=
stack, not just one node. So if we have simultaneous access through either=
multiple layers or through a layer and the underlying file system, then we=
can hit a situation where a process has a lock on the parent stack but not=
a reference to our explicit parent vnode. Thus vrele() will try to lock=20
the parent stack and deadlock against the other access.

In the PR, I outline a few options. One is a routine that handles the=20
locking explicitly. Another is to have the VOP_LOOKUP() routines not=20
automatically unlock the parent. If the parent stays locked, we can just=20
vput() the parent which avoids the re-lock issue.

I've proposed one patch, which adds an explicit lock-handling routine,
vrele2(). We pass it the to-be-released vnode and another, locked, vnode.
It behaves the same as vrele, except that if a trylock fails, it releases
the other vnode's lock, locks the to-be-released one, and relocks the
other one.

I think it's kinda gross, but it's a simple, direct fix.

We have tried this patch on, and it has fired a few times.=
Each time it has fired is a time when we would have deadlocked the system=
and had to reboot.

Chuck Silvers has also proposed a fix which changes the locking protocol.=
Now, VOP_LOOKUP() won't unlock the parent. The patch is at=20 and there's=
a patch for 3.X in that directory.

I think this fix is the right way to go in the long run. I'm uncomfortable
though pulling this into 3.x as it's a big change.  Actually, it's a big
change with a number of other, little changes. On the flip side, though,=20
not pulling this change into 3.X means pulling in vrele2() then retiring=20
it in 4.X. :-(

I do like getting rid of WANTPARENT, and I also like lookup(), relookup(),
and VOP_LOOKUP() ignoring LOCKPARENT. All of that made my head hurt.

I think the fh<->vp vfs changes are good and probably should just go into
the tree now.

I don't think we can get rid of PDIRUNLOCK, as the reason we have it still
exists (unless I missed something). When we look up "..", we have to
unlock the parent directory, lock ".." (which is really its grandparent,
the parent of the "parent" directory), then re-lock the parent. That
re-locking can fail, yet we have no real way to communicate that back up
the call graph. Think of it as an extra return value. Actually, with this
change, we won't even notice.

The fix for layer_node_find() touches on another PR hitting layered file
systems. dbj@ and ad@ looked into it, and reported details which I filed
as PR 35143. I completely missed the fact that if LK_TYPE_MASK & flags is
zero, then vget() will skip lockmgr(). That's exactly what this code
needed. The problem with the change, though, is that it still has the
deadlock issue.

I think we should rip out all of the LAYERFS_UPPERLOCK() stuff. All of our
file systems now support real locking. I think all we need to do is call
VOP_LOCK() on the lower layer (after turning LK_DRAIN into LK_EXCLUSIVE
and fixing up the interlock). As long as the lower layer locking routine
behaves as if it is a lockmgr() call (if you pass in LK_EXCLUSIVE, when
you return, no one else gets to return until you call VOP_UNLOCK(), taking
recursion into account), then everything is fine.


Take care,


Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.4.3 (NetBSD)