tech-net: NFS hangs on 2.0 client

Subject: NFS hangs on 2.0 client
To: None <tech-net@netbsd.org>
From: Jeff Rizzo <riz@tastylime.net>
List: tech-net
Date: 01/07/2005 09:45:28
I had a brief discussion about this on netbsd-help, but the problem just 
happened again, so I thought I'd solicit wider opinions.

I have an NFS client and NFS server both running NetBSD/i386 2.0, with 
MP kernels.  The network interfaces are both fxp (Intel 82559), and they 
are connected by a LAN switch which appears to be operating normally;  
non-NFS traffic appears to go correctly between the hosts, and a second 
NetBSD/i386 box (running 2.0_BETA) is accessing (albeit read-only) a 
share from the server without problems.  The mounts are all UDP.  (The 
hanging mounts are rw, and have the "soft" and "intr" flags

The symptoms are this:  upon a fresh boot, the client can access NFS 
shares on the server just fine;  I'm doing pkgsrc bulk builds on the 
client, and storing the built packages on one of the NFS volumes.  After 
some period of time (this time it was ~36 hours), all NFS accesses from 
this client hang.  From what I can tell using tcpdump, an 'ls' on the 
nfs share generates NO traffic between client and server.   Following 
suggestions from the mount_nfs man page, I looked at the output of 
'netstat -s' for info on fragments and UDP.

On the client, I see:

        1598454 fragments received
        1 fragment dropped (dup or out of space)
        0 fragments dropped (out of ipqent)
        0 malformed fragments dropped
        44 fragments dropped after timeout
<snip>
udp:
        6704322 datagrams received
        0 with incomplete header
        0 with bad data length field
        2 with bad checksum
        76 dropped due to no socket
        0 broadcast/multicast datagrams dropped due to no socket
        26 dropped due to full socket buffers
        6704218 delivered
        6700262 PCB hash misses
        6842098 datagrams output

On the server, I see:

        6085968 fragments received
        1 fragment dropped (dup or out of space)
        0 fragments dropped (out of ipqent)
        0 malformed fragments dropped
        277 fragments dropped after timeout
<snip>
udp:
        8236329 datagrams received
        0 with incomplete header
        0 with bad data length field
        0 with bad checksum
        862 dropped due to no socket
        0 broadcast/multicast datagrams dropped due to no socket
        0 dropped due to full socket buffers
        8235467 delivered
        8025761 PCB hash misses
        8247713 datagrams output


The fragments dropped after timeout does not appear to be incrementing, 
and also does not seem overly large to me, given the total number of 
fragments.  'nfsstat -w 1' on the client shows _no_ activity.

A reboot of the client clears this up, so I'm going to leave the system 
in this state for a little while, in case anyone has suggestions for 
what I might check.  Does anyone have any thoughts?

Thanks,
+j