Subject: lockinit / vnlock deadlocking
To: None <tech-kern@netbsd.org, port-alpha@netbsd.org>
From: Stephen M. Jones <smj@cirr.com>
List: tech-kern
Date: 11/08/2003 02:40:15
I hope that this can help.  The problem that I'm seeing on our NFS clients
nearly on a daily basis is that a client will get hung in a vnlock deadlock.

To help track this problem, I changed the "vnlock" text in lockinit() on
line 538 in the file kern/vfs_subr.c to "vnlockone" as well, I changed the
one on line 1101 to "vnlocktwo" just this past week. 

We've had about 5 vnlock deadlocks .. each time, it has been 'vnlockone' 
on line 538.  

If the client is being used and the deadlock occurs, usually local file 
access will be fine, but remote access (ls, df, et cetera) will hang and
a ^T will report "vnlockone".  Once the deadlock occurs, the machine will
become swamped by user requests, cronjobs and such so that the only 
solution is to do a hard reboot.  

I don't believe this is an ethernet driver issue (if some believe it could
be.. but maybe I'm wrong.. ) the reason why is that the DS10Ls use the tlp
driver while the CS20s use the fxp driver.  Both can get deadlocked with the
same behaviour.

Depending on the situation, the kernel can panic .. but most of the time 
it will just hang until the process table has filled.  

Some things I'd like to note:

* nfsd options are -tun 12
* there are 7 nfs clients (mounting home directories, webspace & mail)
* typically 80 to 100 users per client at any given time
* nfs is run on a seperate network from normal traffic using independent
  ethernet interfaces
* occassionally, almost rarely messages will be seen regarding the server
  not responding, then responding.
* bufcache on the fileserver is 12% of 1024mb of ram
* only two or three nfsd actually seem to be busy

Just for a clear reference, here is a portion of the vfs_subr.c code:

    535         }
    536         vp->v_type = VNON;
    537         vp->v_vnlock = &vp->v_lock;
    538         lockinit(vp->v_vnlock, PVFS, "vnlockone", 0, 0);
    539         cache_purge(vp);
    540         vp->v_tag = tag;
    541         vp->v_op = vops;  
    542         insmntque(vp, mp);
    543         *vpp = vp;
    544         vp->v_usecount = 1;
    545         vp->v_data = 0;
    546         simple_lock_init(&vp->v_uobj.vmobjlock);

If there is any other information I can provide, please let me know.  Also,
since this is happening on a daily basis, I'd be happy to work closely 
with a kernel guru to see if we can sort this out.