Subject: unkillable nfsd processes
To: None <netbsd-help@netbsd.org>
From: theo borm <theo_nbsdhelp@borm.org>
List: netbsd-help
Date: 11/23/2004 22:45:52
Hi,

The situation is this: I have one server with a few big HDs and a number
of diskless clients mounting root over NFS. Under normal circumstances
these clients crunch their numbers happily, but once in a while I do some
more filesystem intensive tasks, such as compiling programs, and then
clients start to (reliably) die with "nfs server not responding" errors, 
and,
(what is worse) after this has happened to a few clients, the nfs server
itself stops responding to nfs requests. (both server and clients keep
responding to ICMP requests)

Now, normally I would expect such a problem to be cured by itself after
a while (timeout), or by root sending a few signals to the relevant
processes. Not in this case however; the problem persists (at least more
2 hours) and all nfsd: server processes are *unkillable*. If I read the man
page correctly: ("9    KILL (non-catchable, non-ignorable kill)") sending
signal 9 cannot be ignored, and yet, nfsd seems to be doing just that.

Under "normal" (non-stressed) circumstances all these processess /can/ be
killed; after NFS stress a (last resort) "shutdown -h" on the server hangs
indefinitely after "synching disks 1 1", and a (hard) reboot results in a
lengthy fsck and a rebuild of parity on the raid set.

The problem is a.f.a.i.c.t. absent from kernel versions prior to 1.6.1, and
I can confirm that I see the problem in 1.6.2 STABLE and 2.0BETA
kernels. I noticed that I can prevent the problem by tweaking down the
receive and transmit block sizes (mount options), but /client side/
configuration tweaks to prevent /server/ crashes are not my favorite, and
besides, I have still to figure out how to do this on (PXE) netbooted 
clients
which mount root over NFS

Does all this ring a bell? Does anyone know what makes nfsd unkillable,
and how to kill rogue nfsd processes anyway?

with kind regards,

Theo Borm