Subject: kern/8624: NFS servers crash when clients run dev_mkdb
To: None <email@example.com>
From: Miles Nordin <carton@Ivy.NET>
Date: 10/14/1999 04:33:10
>Synopsis: nfs client activity can crash an nfs server
>Responsible: kern-bug-people (Kernel Bug People)
>Arrival-Date: Thu Oct 14 03:00:01 1999
>Originator: Miles Nordin
Miles Nordin / v:1-888-857-2723 fax:+1 530 579-8680
555 Bryant Street PMB 182 / Palo Alto, CA 94301-1700 / US
NetBSD/sparc-current NFS server with a NetBSD/i386 snapshot netbooted client
System: NetBSD casey 1.4L NetBSD 1.4L (CASEY) #4: Wed Oct 13 23:44:58 MDT 1999 carton@casey:/scratch/src/sys/arch/sparc/compile/CASEY sparc
When a NetBSD machine netboots off casey (above), casey crashes while the
netbooted client is running dev_mkdb. The problem happens with NetBSD/i386
and NetBSD/mac68k clients at the very least. It happens with a server running
current-19990909 and current-19991012 at least, possibly older. Strangely,
it does not happen all the time--i was able to boot my nfsrooted client a
few times after I first installed it. but, once it starts happening, it
happens every time from then on. For example, it started happening with my
mac68k, so i gave up on the mac68k for a while. then i installed the i386
and it worked for a while. then it started happening on the i386, too. It
is, in that sense, repeatable. but unfortunately it doesn't seem to happen
absolutely every time i netboot a client. perhaps clients only work until
i've gotten them configured right? not sure.
commenting out dev_mkdb from /etc/rc on the client seems to eliminate
the problem. the netbooted client works fine under light load so far, and
casey doesn't crash.
casey (the NFS server)'s kernel is compiled with DIAGNOSTIC. I built
another kernel with -g, but i couldn't get a core dump and don't know
how to make use of netbsd.gdb without one--help with this?
Here is the output of ddb, typed by hand so there might be an error or two.
I can recrash the machine and double-check it if this becomes a problem.
panic: nfsd: locking botch in op 3
Stopped in nfsd at Debugger+0x4: jumpl [%o7 + 0x8], %g0
nfssvc_nfsd(0x0, 0x2, 0xf0345d20, 0xf01e2028, 0xf01e9720, 0xf1ae9dc0) at nfssvc_nfsd+0x6a4
sys_nfssvc(0x0, 0xf1ae9f28, 0xf1ae9f20, 0xf00e6d70, 0xeffffa18, 0xf1ae9fb0) at sys_nfssvc+0x5b8
syscall(0x9b, 0xf1ae9fb0, 0x0, 0x1, 0x0, 0xf1ae9fb0) at syscall+0x1fc
_syscall(0x4, 0x21b28, 0x18, 0x10c60, 0x217c8, 0x10108) at _syscall+0x120
dumping to dev 7,1 offset 143327
dump error 19
here is a tcpdump around the time of the crash:
02:32:49.728082 18.104.22.168.1498621590 > 22.214.171.124.2049: 108 lookup fh 9,2/1162674 "rwd3f"
02:32:49.730557 126.96.36.199.2049 > 188.8.131.52.1498621590: reply ok 236 lookup fh 9,2/1163515
02:32:49.731693 184.108.40.206.1498621591 > 220.127.116.11.2049: 108 lookup fh 9,2/1162674 "rwd3g"
02:32:49.734141 18.104.22.168.2049 > 22.214.171.124.1498621591: reply ok 236 lookup fh 9,2/1163516
02:32:49.735403 126.96.36.199.1498621592 > 188.8.131.52.2049: 108 lookup fh 9,2/1162674 "rwd3h"
02:32:49.737874 184.108.40.206.2049 > 220.127.116.11.1498621592: reply ok 236 lookup fh 9,2/1163517
02:32:49.739021 18.104.22.168.1498621593 > 22.214.171.124.2049: 104 lookup fh 9,2/1162674 "sd0a"
02:32:49.780410 126.96.36.199.1498621593 > 188.8.131.52.2049: 104 lookup fh 9,2/1162674 "sd0a"
02:32:49.870335 184.108.40.206.1498621593 > 220.127.116.11.2049: 104 lookup fh 9,2/1162674 "sd0a"
02:32:50.040236 18.104.22.168.1498621593 > 22.214.171.124.2049: 104 lookup fh 9,2/1162674 "sd0a"
I would really like to get this problem fixed, so if someone is interested
in working on it I can devote a significant amount of my time to doing
what you tell me, and can crash this box as much as is necessary. not that
there's anything out of the ordinary about this offer--just confirming that
it's no problem for me if you want more info.
Boot a NetBSD diskless client from a NetBSD NFS server.
Don't run dev_mkdb on diskless clients