Subject: Severe problem with mountd
To: None <netbsd-help@NetBSD.ORG>
From: Tim Rightnour <root@garbled.net>
List: netbsd-help
Date: 04/19/1998 04:51:51
I seem to be having severe problems with mountd on a heavily loaded NFS local
network all running 1.3.

I have 7 machines, 4 of these machines have thier own root partition, and NFS
mount /usr.  One is a server to them all, and the other two (sparcs) are mostly
self sufficient, but get a few NFS mounts from the server (src and whatnot).

I *should* be able to down the nfs server, and bring it back up, allowing all
the other machines to reconnect to it and be happy.  Instead what happens is
the following:

Server goes down.. not surprisingly all the other machines immediately go into
an NFS-hang state.  Server comes back up.  Ideally the other machines all see
the server again, and happily reconnect and resume normal life.  Once in awhile
this happens.. but more often than not, one or two machines will come back to
life fully, while other machines don't come back at all, or have a few mounts
restored and the rest just hang.

Once the machine is up.. nothing will bring the others back to life.  Killing
mountd and nfsd and restarting, hupping mountd, none of this will save the
other machines.  If I attempt to reboot them all at once, they immediately
re-hang on the /usr mount.  If I bring all the hung machines down, and bring
them back up, one at a time.. they will be happy again.

This is very abnormal behavior.  I'm not familiar enough with mountd to tell
exactly what is going on here.. and I'm certainly not very excited about the
prospect of doing tests over and over with my server to take shots in the dark
at it.

I launch my server with nfsd -tun 8

I also have other problems with mountd..  Such as hanging on boot reading the
exports line.. as can be seen on one of my slaves which exports it's own small
filesystem:

pollux# mountd -d
Getting export list.
Got line /distrib -alldirs -network 192.168.10 -maproot=811:100
Making new ep fs=0x7,0x78b
doing opt -alldirs -network 192.168.10 -maproot=811:100
doing opt -network 192.168.10 -maproot=811:100

It will just hang here.  If I let the machine sit for about a day, and then run
it.. it goes the rest of the way through, and works just fine.  Sometimes it
hangs, others it doesn't.  Its very unpredictable and transient.

Any help would be appreciated.. It's making maintinence on the server a
nightmare..  the slave machines run without monitors and keyboards, so I have
to plug them in to see what is going on with them, rebooting all my machines
can take an hour or two.. beyond the initial downtime of the server itself..

---
Tim Rightnour    -  root@garbled.net
http://www.zynetwc.com/~garbled/garbled.html