Subject: kern/32318: NFS client or server hang
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: netbsd-bugs
Date: 12/16/2005 19:00:01
>Number:         32318
>Category:       kern
>Synopsis:       NFS client or server hang
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Dec 16 19:00:01 +0000 2005
>Originator:     Manuel Bouyer
>Release:        NetBSD 3.0_RC3
>Organization:
>Environment:
System: NetBSD chassiron.antioche.eu.org 3.0_RC3 NetBSD 3.0_RC3 (CHASSIRON) #0: Sat Nov 26 15:11:16 CET 2005 bouyer@pop.lip6.fr:/local/pop1/bouyer/tmp/sparc/obj/local/pop1/bouyer/netbsd-3/src/sys/arch/sparc/compile/CHASSIRON sparc
Architecture: sparc
Machine: sparc
>Description:
	Setup: I get mail from various pop3 server via fetchmail and
	deliver to local folders (mbox format) via procmail, the folders are
	stored on a NFS server.
	fetchmail/procmail run on a x86 box (celeron 500) running a months-old
	current:
NetBSD rochebonne.antioche.eu.org 3.99.7 NetBSD 3.99.7 (ROCHEBONNE) #1: Tue Aug  9 23:54:57 CEST 2005  bouyer@pop.lip6.fr:/local/pop1/bouyer/tmp/i386/obj/local/pop1/bouyer/current/src/sys/arch/i386/compile/ROCHEBONNE i386
	The NFS server is a sparc IPX (40Mhz sparcv7).

	Problem: from time to time, the process accessing the files on
	the NFS server hang. This usually happens when the client does
	2 concurent accesses to the mailboxes (e.g. reading a mailbox
	with mutt while procmail tries to deliver a mail to this mailbox).
	I've seen this also before the 3.0 branch was cut, with the NFS server
	running 2.0 or 2.1. I've never noticed this when the server was running
	1.6.2 (it started happening when the server got upgraded).
	Doing a /etc/rc.d/nfsd restart on the server unwedge the processes
	on the client box.

	Today I managed to reproduce this with a tcpdump running.
	The full trace is at:
	ftp://chassiron.antioche.eu.org/pub/private/nfs.hang.gz
	(the hang begins at 19:19:35, I ran the /etc/rc.d/nfsd restart at
	19:23:03).
	When the processes are stuck, the only traffic between
	the client and server are:
19:19:35.106216 IP rochebonne.antioche.eu.org.82 > chassiron.localhost.nfs: 40 n
ull
19:19:35.108362 IP chassiron.localhost.nfs > rochebonne.antioche.eu.org.82: repl
y ok 24 null
	
	Before that the server sent a stream of
19:19:24.927421 IP chassiron.localhost.nfs > rochebonne.antioche.eu.org.1098072401: reply ERR 1460
	I'm not sure if it's normal or not (is this an error, or a normal
	reply to a read ?)
	It also looks like the client opened a second TCP connection at
	19:19:26.792149, maybe for the concurrent accesses ?

	To me it looks like this request:
19:19:26.845210 IP rochebonne.antioche.eu.org.809670347 > chassiron.localhost.nfs: 148 lookup fh 25,15/13347 "_bX.uUwoDB.rochebonne.antioch"
	got no reply and this is what caused the hang. After the nfsd restart,
	the same request was sent 2 times, the second one got the reply
	"no such file or directory"

	Now I don't know if this is a client or server side issue. The
	server seems to loose requests, but is the client supposed to
	retry with NFS over TCP ?

>How-To-Repeat:
	Try concurent accesses to the same file or directory against
	a slow NFS server ?
>Fix:
	yes, please