Subject: Re: How can I help with hung vnlock()'ed clients?
To: Stephen M. Jones <smj@cirr.com>
From: Artem Belevich <art@riverstonenet.com>
List: tech-kern
Date: 11/04/2003 18:06:55
I occasionally see this problem on my NetBSD-1.6.1/i386 boxes.  In my
case NetBSD is a client and NetAPP file server & bunch of solaris
boxes as NFS servers. Everything is heavily automounted (and
unmounted, too).

The problem usually happens on the box with Intel's i82559 NIC
(if_fxp). The NIC in question occasionally gets stuck for about a
minute, prints "fxp0: device timeout" on the console and starts
working again. This may or may not have something to do with the
problem.

The box with 3com card (if_ex) gets stuck occasionally too, but way
less often ( once every few months vs. once every few weeks).

Good news is that sometimes everything recovers if you do 'ls' on the
filesystem that's mounted from the same place that locked file comes
from. Chances are 50/50. The bad news is that if you're unlucky,
you'll get another stuck process. :-(

--Artem

On Thu, Oct 30, 2003 at 08:50:56PM -0600, "Stephen M. Jones" <smj@cirr.com> wrote:
> Hi there!
> 
> I'm wondering what is the best way I can help with a problem I've experienced
> and seen others experience for a couple of years now.  The problem is
> where an NFS client will get 'hung' in a vnlock() and will most likely not
> recover.  Since I have the oppourtunity to run NetBSD in a high usage
> environment, I thought I might be able to help out .. but I want to know
> what is the best way to.
> 
> Typically the NFS server and clients run the same kernel version which is
> built from the 1.6.1 release tree.  I currently save a trace and ps when
> in the debugger and then dump core to preserve the memory image.  Unfortunately
> for performance reasons I can not run a kernel with symbols on the server,
> but I can on the (a) client.  Note, the server never seems to have the vnlock
> problem, only the clients .. so the server will typically run for weeks
> without an issue .. its the clients that can get in a vnlock() hang anywhere
> from 2hours of booting to 2weeks and anywhere in between.
> 
> If you have any tips or suggestions on how I can get evidence together so 
> we can nail this one, please let me know .. I'll do my best.  I'm also aware
> that the saved cores should only be looked at by trusted eyes.
> 
> Thanks
> Stephen
>