tech-kern: Re: NFS wedging.

Subject: Re: NFS wedging.
To: Todd Whitesel <toddpw@best.com>
From: Jim Reid <jim@rfc1035.com>
List: tech-kern
Date: 01/26/2000 12:28:34
>>>>> "Todd" == Todd Whitesel <toddpw@best.com> writes:

    Todd> Something I have gotten in a habit of recently, is running
    Todd> lots of rm jobs in parallel because it seems to keep the
    Todd> kernel busy while the disk is seeking. I have a cheap script
    Todd> method that tries to remove stuff in roughly 'find' order so
    Todd> it's not totally random across the disk and it does seem to
    Todd> speed things up on most of my machines.

    Todd> However, on the diskless machines, it's a lose, and when I
    Todd> forget, it causes real trouble because it appears to wedge
    Todd> NFS hard.

    Todd> Specifically, a 333mhz iMac mounting root/swap from a 125mhz
    Todd> HPUX 10.20 box which holds up quite well under load.

    Todd> When I start running, say, -j16 rm's, very shortly the iMac
    Todd> complains that the server is not responding, exactly once. I
    Todd> believe NFS is hung from that point on, because further
    Todd> attempts to use the iMac result in more and more processes
    Todd> getting blocked on something over NFS, until the whole iMac
    Todd> is effectively wedged.

Yeah. It looks like sending 16 simultaneous NFS requests to the HPUX
server - plus whatever others are doing NFS to that box - gives it a
very hard time. As a general rule of thumb, NFS clients and servers
have to be balanced. It's bad news if an NFS client is faster at
sending requests than the server is at processing them. This should
not be a surprise to anyone.

    Todd> nfsiod is at the default of 4, but I don't understand why
    Todd> overloading them would have such a nasty effect.

I doubt your nfsiod processes are the problem. They're probably not
doing anything while your highly parallel remove is in progress. It's
the nfs code on the server that'll be hurting. And you're stressing
that very heavily.

Each NFS unlink request forces the server to write the directory file
back to disk so there may well be nasty synchronisation issues for the
HPUX kernel in such extreme circumstances. Directory operations are
one of the most expensive operations in NFS. And if the server is too
slow at replying, your client rm processes repeat the previous unlink
request because the previous one timed out, producing a vicious
infinite loop. Experimenting with the NFS mount parameters - timeouts,
etc - and the number of *server* side NFS daemons might alleviate the
problem. So might NFS-over-TCP because this'll introduce some
flow-control and backoff when the other end is busy.

The default number of NFS processes is usually sub-optimal too. A good
heuristic there is to have 1 process for each disk spindle and network
interface plus a couple of "spares". ie Each process can be in a busy
I/O wait and there are still some processes available to be scheduled
for further NFS requests. [The number 4 seems to be a legacy from the
earliest days of NFS on a Sun3.] Other considerations for the number
of NFS processes are the thundering herd problem - waking every
nfs(io)d for 1 incoming query - and CPU/MMU context thrashing. IIRC
this was why some early SPARCs didn't run more than 8 nfsd's. Sun
wanted to be sure that some of the MMU's contexts could be kept free
so that they'd be available for applications.

Hope this helps.