Subject: kern/8624: NFS servers crash when clients run dev_mkdb
To: None <>
From: Miles Nordin <carton@Ivy.NET>
List: netbsd-bugs
Date: 10/14/1999 04:33:10
>Number:         8624
>Category:       kern
>Synopsis:       nfs client activity can crash an nfs server
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Oct 14 03:00:01 1999
>Originator:     Miles Nordin
Miles Nordin / v:1-888-857-2723 fax:+1 530 579-8680
555 Bryant Street PMB 182 / Palo Alto, CA 94301-1700 / US
>Release:        current-19991012
NetBSD/sparc-current NFS server with a NetBSD/i386 snapshot netbooted client
System: NetBSD casey 1.4L NetBSD 1.4L (CASEY) #4: Wed Oct 13 23:44:58 MDT 1999     carton@casey:/scratch/src/sys/arch/sparc/compile/CASEY sparc

When a NetBSD machine netboots off casey (above), casey crashes while the 
netbooted client is running dev_mkdb.  The problem happens with NetBSD/i386 
and NetBSD/mac68k clients at the very least.  It happens with a server running 
current-19990909 and current-19991012 at least, possibly older.  Strangely, 
it does not happen all the time--i was able to boot my nfsrooted client a 
few times after I first installed it.  but, once it starts happening, it 
happens every time from then on.  For example, it started happening with my 
mac68k, so i gave up on the mac68k for a while.  then i installed the i386 
and it worked for a while.  then it started happening on the i386, too.  It 
is, in that sense, repeatable.  but unfortunately it doesn't seem to happen 
absolutely every time i netboot a client.  perhaps clients only work until 
i've gotten them configured right?  not sure.

commenting out dev_mkdb from /etc/rc on the client seems to eliminate 
the problem.  the netbooted client works fine under light load so far, and 
casey doesn't crash.

casey (the NFS server)'s kernel is compiled with DIAGNOSTIC.  I built 
another kernel with -g, but i couldn't get a core dump and don't know 
how to make use of netbsd.gdb without one--help with this?

Here is the output of ddb, typed by hand so there might be an error or two.
I can recrash the machine and double-check it if this becomes a problem.

panic: nfsd: locking botch in op 3
Stopped in nfsd at	Debugger+0x4:	jumpl		[%o7 + 0x8], %g0
db> t
nfssvc_nfsd(0x0, 0x2, 0xf0345d20, 0xf01e2028, 0xf01e9720, 0xf1ae9dc0) at nfssvc_nfsd+0x6a4
sys_nfssvc(0x0, 0xf1ae9f28, 0xf1ae9f20, 0xf00e6d70, 0xeffffa18, 0xf1ae9fb0) at sys_nfssvc+0x5b8
syscall(0x9b, 0xf1ae9fb0, 0x0, 0x1, 0x0, 0xf1ae9fb0) at syscall+0x1fc
_syscall(0x4, 0x21b28, 0x18, 0x10c60, 0x217c8, 0x10108) at _syscall+0x120
db> c
dumping to dev 7,1 offset 143327
dump error 19

here is a tcpdump around the time of the crash:

02:32:49.728082 > 108 lookup fh 9,2/1162674 "rwd3f"
02:32:49.730557 > reply ok 236 lookup fh 9,2/1163515
02:32:49.731693 > 108 lookup fh 9,2/1162674 "rwd3g"
02:32:49.734141 > reply ok 236 lookup fh 9,2/1163516
02:32:49.735403 > 108 lookup fh 9,2/1162674 "rwd3h"
02:32:49.737874 > reply ok 236 lookup fh 9,2/1163517
02:32:49.739021 > 104 lookup fh 9,2/1162674 "sd0a"
02:32:49.780410 > 104 lookup fh 9,2/1162674 "sd0a"
02:32:49.870335 > 104 lookup fh 9,2/1162674 "sd0a"
02:32:50.040236 > 104 lookup fh 9,2/1162674 "sd0a"
[...more retransmits...]

I would really like to get this problem fixed, so if someone is interested 
in working on it I can devote a significant amount of my time to doing 
what you tell me, and can crash this box as much as is necessary.  not that 
there's anything out of the ordinary about this offer--just confirming that 
it's no problem for me if you want more info.

Boot a NetBSD diskless client from a NetBSD NFS server.
Don't run dev_mkdb on diskless clients