Subject: kern/36150: vnlock deadlock on nbftp due to layerfs and LFS interaction
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Chuck Silvers <chuq@chuq.com>
List: netbsd-bugs
Date: 04/16/2007 04:25:00
>Number:         36150
>Category:       kern
>Synopsis:       vnlock deadlock on nbftp due to layerfs and LFS interaction
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 16 04:25:00 +0000 2007
>Originator:     Chuck Silvers
>Release:        NetBSD 4.0_BETA2
>Organization:
>Environment:
System: NetBSD babylon5.netbsd.org 4.0_BETA2 NetBSD 4.0_BETA2 (NBFTP) #3: Thu Mar  8 07:35:12 GMT 2007  spz@ADMIN:/usr/obj/sys/arch/amd64/compile.amd64/NBFTP amd64
Architecture: amd64
Machine: amd64
>Description:

in one of ftp.netbsd.org's recent vnlock crash dumps,
I found these two deadlocked threads:


thread 1
29129 ffff800048d1b000 vnlock   (bozohttpd)

(gdb) pcb 0xffff800048d1b000
(gdb) bt
#0  0xffffffff80259cfe in mi_switch ()
#1  0xffffffff8025a5fd in ltsleep ()
#2  0xffffffff802431dd in acquire ()
#3  0xffffffff802441c8 in _lockmgr ()
#4  0xffffffff802af978 in VOP_LOCK ()
#5  0xffffffff801f9531 in lfs_putpages ()
#6  0xffffffff802afbe0 in VOP_PUTPAGES ()
#7  0xffffffff802a2fc9 in vinvalbuf ()
#8  0xffffffff802a3359 in vclean ()
#9  0xffffffff802a38f2 in vgonel ()
#10 0xffffffff802a3d6c in getcleanvnode ()
#11 0xffffffff802a3eb9 in getnewvnode ()
#12 0xffffffff802b4361 in layer_node_alloc ()
#13 0xffffffff802b42e0 in layer_node_create ()
#14 0xffffffff802b5512 in layer_lookup ()
#15 0xffffffff802af37a in VOP_LOOKUP ()
#16 0xffffffff8029ee86 in lookup ()
#17 0xffffffff8029f462 in namei ()
#18 0xffffffff802a7acc in sys___lstat30 ()
#19 0xffffffff802e5ed5 in syscall_plain ()


thread 2
 1565 ffff800049bd0000 vget   (cvs)

(gdb) pcb 0xffff800049bd0000
(gdb) bt
#0  0xffffffff80259cfe in mi_switch ()
#1  0xffffffff8025a5fd in ltsleep ()
#2  0xffffffff802a2bfc in vget ()
#3  0xffffffff802b4226 in layer_node_find ()
#4  0xffffffff802b429d in layer_node_create ()
#5  0xffffffff802b5512 in layer_lookup ()
#6  0xffffffff802af37a in VOP_LOOKUP ()
#7  0xffffffff8029ee86 in lookup ()
#8  0xffffffff8029f462 in namei ()
#9  0xffffffff802ad1e9 in vn_open ()
#10 0xffffffff802a9ec0 in sys_open ()
#11 0xffffffff802e5ed5 in syscall_plain ()


here's how we got here:

thread 1 calls getnewvnode(), which chooses an nullfs REG vnode to recycle.
    vclean() sets XLOCK on the nullfs vnode and acquires the VOP_LOCK on
    both the nullfs vnode (and thus on the underlying LFS vnode as well).
    next it calls vinvalbuf on the nullfs vp, which calls layer_putpages(),
    which calls the underlying lfs_putpages(), which ends up sleeping
    (still holding VOP_LOCK).

thread 2 calls layer_lookup() looking for the vnode being recycled above.
    layer_lookup() tries to look up the lower vnode, but finds it locked.
    this thread sets LK_WANT_EXCL on the LFS vnode's VOP_LOCK and sleeps.

thread 1 wakes up and reaches the point in lfs_putpages() where the VOP_LOCK
    is released.  eventually this thread sleeps.

thread 2 wakes up finishes getting the VOP_LOCK on the LFS vnode.
    then this thread looks up the nullfs vp, calls vget() on it,
    sees that XLOCK is set (by thread 1) and sleeps waiting for
    the XLOCK to be released.

thread 1 wakes up, eventually tries to reacquire the VOP_LOCK,
    sees that the VOP_LOCK is held (by thread 2) and sleeps
    waiting for the VOP_LOCK to be released.

deadlock.


there two problems here:

 - layer_putpages() shouldn't pass the request through to the underlying vnode
   if the request is part of vnode recycling.  to fix this I'll add a flag
   PGO_RECLAIM similar to FSYNC_RECLAIM and have the layered file systems
   just return if they see this new flag in their putpages routines.
   this change alone is enough to avoid this particular deadlock.
 - lfs_putpages() really shouldn't be releasing the vnode lock.
   this breaks the POSIX read/write atomicity rule, and it seems
   generally like a bad idea.  I haven't looked into the LFS code to
   see how this code might be reworked.  


>How-To-Repeat:

run nullfs over LFS

>Fix:

see above for the layerfs part.
the LFS part is not provided.