Subject: Re: kern/32535: processes stuck on vnlock
To: None <gnats-bugs@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: netbsd-bugs
Date: 01/15/2006 10:26:01
these two processes are each waiting for a lock that the other is holding:

 9684 cd250000 vnlock   ?      RW   0:30.00 (cvs)
21343 cd358000 vnlock   ?      RW   0:33.00 (rsync)


their stack traces are:

cvs
(gdb) bt
#0  0xcd3f829c in ?? ()
#1  0xc024b1fe in bpendtsleep ()
#2  0xc0239a4e in acquire ()
#3  0xc023a217 in lockmgr ()
#4  0xc0291b50 in layer_lock ()
#5  0xc028c25b in vn_lock ()
#6  0xc027fb5c in vrele ()
#7  0xc027d77f in lookup ()
#8  0xc027cf7c in namei ()
#9  0xc0286e38 in sys___stat13 ()
#10 0xc02de469 in syscall_plain ()


rsync
(gdb) bt
#0  0xcd3f8efc in ?? ()
#1  0xc024b1fe in bpendtsleep ()
#2  0xc0239a4e in acquire ()
#3  0xc023a217 in lockmgr ()
#4  0xc028d9fb in genfs_lock ()
#5  0xc028c25b in vn_lock ()
#6  0xc027ad53 in cache_lookup ()
#7  0xc02140d4 in ufs_lookup ()
#8  0xc027d58b in lookup ()
#9  0xc027cf7c in namei ()
#10 0xc0286ed8 in sys___lstat13 ()
#11 0xc02de469 in syscall_plain ()


it looks like the cvs process is taking locks in the wrong order,
since it's got the child directory locked and it's trying to lock
the parent directory in order to call VOP_INACTIVE().  if there
had been no null mounts involved, this hang wouldn't have happened,
since the rsync process would have had another reference on the parent
directory, so vrele() in the cvs process wouldn't have needed to call
VOP_INACTIVE() on the parent vnode and thus not have needed to lock it
either.

yamt's upcoming redesign of the locking for vnode identities might
prevent this particular problem, but it still seems fundamentally wrong
that dropping a reference to one vnode would require taking a lock on
another vnode in the same context.  it would be safer to defer the work
of releasing the reference on the lower vnode to a worker thread,
so there won't be any other locks held to cause problems like this.

-Chuck