Subject: kern/15900: Kernel trap caused by interruptible NFS mounts
To: None <gnats-bugs@gnats.netbsd.org>
From: Artem Belevich <art@riverstonenet.com>
List: netbsd-bugs
Date: 03/13/2002 14:23:33
>Number:         15900
>Category:       kern
>Synopsis:       Kernel trap caused by interruptible NFS mounts
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Mar 13 14:25:02 PST 2002
>Closed-Date:
>Last-Modified:
>Originator:     Artem Belevich
>Release:        NetBSD 1.5.3_ALPHA 2002-02-04
>Organization:
>Environment:
System: NetBSD odin 1.5.3_ALPHA NetBSD 1.5.3_ALPHA (ART) #15: Tue Feb 5 12:51:12 PST 2002 root@odin:/usr/src/sys/arch/i386/compile/ART i386
Architecture: i386
Machine: i386

# Athlon 1.3G, 512M of RAM 
Config: GENERIC + following options
maxusers        64              # estimated number of users
options         NMBCLUSTERS=8192
options         BUFCACHE=50     # %
options         HZ=1000
options         NVNODE=65536

>Description:

I think I've run into yet another NFS glitch in 1.5.3_ALPHA.  It
happens only every other week or so, but it always crashes the kernel
in the same place. And the box is pretty busy compiling stuff most of
the time.
 
Here's the stack trace (it's been copied off the screen by hand) plus
source line location.
 
sigpending1()+0xf ; kern/kern_sig.c:451 
nfs_sigintr()+0x2c ; nfs/nfs_socket.c:1472 
nfs_timer()+0x51 ; nfs/nfs_socket.c:1344 
softclock()+0x121 
hardclock() 
clockintr() 
Xintr0() 
 
sigpending1 causes kernel trap trying to read p->p_siglist. "p" is
picked from one of the elements on the nfs_timer_ch list and seems to
be pointing to the proc structure that's no longer there and that used
to reside in a page that's no longer there.
 
nfs_sigintr doesn't bother to send signal unless mount point has 
"interruptible" flag, so, for now I've switched mount points to be 
non-interruptible. 

>How-To-Repeat:
	It's fairly hard to reproduce.
	In my case there about 10-20 active users, 20-40 NFS-mounted
	directories (via amd) and the box is pretty busy most of the
	time doing compilation(s) in those NFS-mounted directories. Amd
	map entries all have opts:=soft,intr. Swap usage is moderate
	(0-130MB) with some idle processes swapped out.

	Crash usually seems to happen when there's particularly heavy
	NFS activity that happens to coincide with some kind of
	temporary network issue. Couple of times I saw "fxp0: device
	timeout" message right before the crash. Other times there
	were "nfs server not responding/nfs server is alive again".
 
>Fix:
	Here's what I think can be a workaround:
	The code that causes the crash seems to be executed only if
	NFS mount point is interruptible. Mounting NFS filesystems in
	a non-interruptible mode should prevent kernel panic. This is
	yet to be tested, as my boxes had been running with workaround
	enabled only for two days.
>Release-Note:
>Audit-Trail:
>Unformatted: