current-users: Processes stuck waiting on vnlock

Subject: Processes stuck waiting on vnlock
To: None <current-users@netbsd.org>
From: Duncan McEwan <duncan@MCS.VUW.AC.NZ>
List: current-users
Date: 02/21/2001 14:23:59
We're running a current kernel compiled from sources dated 19th/20th Feb
(depending on your timezone) and incorporating Chuck's ubc balancing patch.  We
are seeing problems with processes getting stuck in the kernel waiting on
"vnlock".  If it matters, our userland is from around the beginning of Feb, and
this is on pentium III machines.

It seems any process that tries to access a file via NFS (from various flavours
of server, including a slightly older netbsd-current and a solaris 8 system)
can be affected, and the problem can start a few minutes after rebooting.

We broke into DDB while we had several processes stuck like this, and got
the following stack trace (in this case, the process was an slogin).

bpendtsleep (c024f0cc, 14, c034b685, 0, cb24f0cc) at bpendtsleep
lockmgr (cb24f0cc, 30002, cb24f048, cb38fd80, c019fe7b) at lockmgr+0x656
genfs_lock (cb38fd74, c034cb40, cb24f048, 30002, cb38fd98) at genfs_lock+0x18
VOP_LOCK (cb24f048, 30002) at VOP_LOCK+0x2b
vn_lock (cb24f048, 30002, cb24f048, 0, cb38fdc4) at vn_lock+0x46
vget (cb24f048, 20002, cb24b7dc, cb38fe08, c01983e3) at vget+0xbe
nfs_root (c07fbe00, cb38fe04, cb24b7dc, 0, caf10510) at nfs_root+0x1b
lookup (cb38ff00, bdbdd3c4, cb38ff00, 1, 3426) at lookup+0x38f
namei (cb38ff00, bfbfd3c4, 1, cb38ff88, c0811b80) at namei+0x301
vn_open (cb38ff00, 1, 1a4, cb38ff80, cb2f11e4) at vn_open+0x138
sys_open (cb2f11e4, cb38ff88, cb38ff80) at sys_open+0xac
syscall_plain (1f, 1f, 482063a0, 4, bfbfcea4) at syscall_plain+0x98

(The usual caveat regarding the accuracy of the above trace due to it being
transcribed by hand applies - must get that serial console working!)

We looked at several other process in the same state and the trace from vn_lock
to bpendtsleep was pretty much identical (some numbers differed), but took a
different path before vn_lock (ie: one was: sys_stat13, namei, lookup,
VOP_LOOKUP, nfs_lookup, cache_lookup, vget, vn_lock, ...)

We have also seen some "locking against myself" panics with this kernel but
aren't sure whether they are related.

Does anyone have any suggestions for what (recent?) change might have caused
this.  We assume that it is recent since the problem occurs so frequently for
us that otherwise we would have expected lots of others to have encountered it
by now.

Thanks,

Duncan