Subject: kern/36610: nfs mutex bug causes crash
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <dholland@eecs.harvard.edu>
List: netbsd-bugs
Date: 07/06/2007 04:20:01
>Number:         36610
>Category:       kern
>Synopsis:       nfs mutex bug causes crash
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jul 06 04:20:00 +0000 2007
>Originator:     David A. Holland <dholland@eecs.harvard.edu>
>Release:        NetBSD 4.99.20 (20070615)
>Organization:
   Harvard EECS
>Environment:
System: NetBSD tanaqui 4.99.20 NetBSD 4.99.20 (TANAQUI) #16: Thu Jul 5 22:39:28 EDT 2007 root@tanaqui:/usr/src/sys/arch/i386/compile/TANAQUI i386
Architecture: i386
Machine: i386
>Description:

For a while I've been getting occasional hangs on my desktop machine.
Building with LOCKDEBUG converted them to panics that deadlock trying
to sync, which results in no crash dump and thus not a lot of useful
information. Tonight I accidentally found a way to reliably provoke
the problem (downloading binutils-2.17.tar.gz from ftp.gnu.org) so I
was able to get a backtrace from ddb.

And it turns out that in nfs_asyncio() there's a path through the loop
on the condition variable that loses the mutex. I don't know if this
is *the* problem I've been having, but it seems fairly likely; and in
any event it's definitely *a* problem.

I've been working against the last version I happened to update,
because I didn't want to disturb anything, but the problem is still in
today's current.

Patch follows. With it, I can download binutils. Dunno if it's going to 
help my long-term stability yet, obviously...

>How-To-Repeat:
Good luck. But at least you can inspect the code and see that it's wrong...

>Fix:

Index: nfs_bio.c
===================================================================
RCS file: /cvsroot/src/sys/nfs/nfs_bio.c,v
retrieving revision 1.156
diff -U7 -r1.156 nfs_bio.c
--- nfs_bio.c	12 Jun 2007 09:42:27 -0000	1.156
+++ nfs_bio.c	6 Jul 2007 02:53:24 -0000
@@ -824,14 +824,15 @@
 				mutex_exit(&nmp->nm_lock);
 				if (nfs_sigintr(nmp, NULL, curlwp))
 					return (EINTR);
 				if (catch) {
 					catch = false;
 					slptimeo = 2 * hz;
 				}
+				mutex_enter(&nmp->nm_lock);
 			}
 
 			/*
 			 * We might have lost our iod while sleeping,
 			 * so check and loop if nescessary.
 			 */