Subject: NFS timeo and retrans parameters via amd
To: None <netbsd-users@netbsd.org>
From: Aaron J. Grier <agrier@poofygoof.com>
List: netbsd-users
Date: 04/25/2005 23:51:42
I left for vacation early last week and when I returned to my dungeon
was greeted by hung NFS mounts on my mailserver, and 999 messages in the
postfix queue, even though neither client nor server had rebooted.

NFS client: dual 50MHz sparcstation 20, NetBSD 2.0.1
NFS server: alphaserver 1000A 5/400, NetBSD 2.0.2_STABLE

linked via 10mbit half duplex during the original misbehavior.

the only rational explaination I can think of is that some critical RPC
packets got lost on the wire, and there's something wrong in the guts of
the NFS client timer code such that it doesn't properly send retries.

I haven't seen misbehavior like this since 1.4.x days, but figured that
since the server was still up and rpcinfo -p looked normal, that
something had simply gotten lost on the client.  so I rebooted it.  (it
would sync the disks and then hang, but that may or may not be related.)

when it came back up, it started working through the backlog, delivered
about a dozen or so messages (thanks bogofilter) and then the dreaded
"nfs server alpha:/usr/home: not responding" appeared.  that's when I
figured there was some timeout knob that needed to be tweaked, so I
started hitting the mount_nfs(8) and am-utils.info docs.

I've got everything running through amd(8), so I figured I could just
add the relevant knob tweakage to /defaults and things would be happy.
so I did just that:

$ ypmatch /defaults amd.net
opts:=intr,retrans=500,timeo=10

but after a reboot it didn't seem to affect anything:

$ mount -vv
[...snip...]
alpha:/usr/home on /amd/alpha/usr/home type nfs (writes: sync 0 async 0, [nfs: addr=10.0.0.28, port=2049, addrlen=16, sotype=1, proto=6, fhsize=0, flags=0x8258<timeo,retrans,intr,nfsv3,resvport>, wsize=8192, rsize=8192, readdirsize=8192, timeo=100, retrans=100, maxgrouplist=16, readahead=2, leaseterm=30, deadthresh=9])

and I was back in the same situation: mounts hung right before my eyes
after delivering a dozen or so mails.

so I fiddled around some more, and added a couple more tweaks to the
/defaults field in amd.net:

$ ypmatch /defaults amd.net
opts:=dumbtimr,retry,rsize=4096,wsize=4096,retrans=500,intr
$ mount -vv
[...snip...]
alpha:/usr/home on /amd/alpha/usr/home type nfs (writes: sync 0 async 0, [nfs: addr=10.0.0.28, port=2049, addrlen=16, sotype=1, proto=6, fhsize=0, flags=0x8256<wsize,rsize,retrans,intr,nfsv3,resvport>, wsize=4096, rsize=4096, readdirsize=4096, timeo=300, retrans=100, maxgrouplist=16, readahead=2, leaseterm=30, deadthresh=9])

it figured out rsize and wsize, but retrans isn't affected.  why did
timeo change?  I didn't touch it.

so I decided to dig a little further and see if I could add the flags
manually to mount_nfs:

# mount_nfs -o rsize=8192,wsize=8192 -d -T -x 500 -i alpha:/usr/home /mnt
# mount -vv
[...snip...]
arwen:/usr/home on /mnt type nfs (writes: sync 0 async 0, [nfs: addr=10.0.0.28, port=2049, addrlen=16, sotype=1, proto=0, fhsize=0, flags=0x8a56<wsize,rsize,retrans,intr,nfsv3,dumbtimr,resvport>, wsize=8192, rsize=8192, readdirsize=8192, timeo=300, retrans=100, maxgrouplist=16, readahead=2, leaseterm=30, deadthresh=9])

it's my understanding that the "-x 500" should change retrans to 500.
it's not.

WTF is going on?  do I have my fingers on the right knobs?  what I'm
doing should be working, correct?

right now it looks like mount_nfs isn't properly setting mount
parameters, so I'm not surprised that amd can't either.  I'm trying to
avoid debugging the NFS client code if possible, but there seems to be a
paucity of NetBSD NFS experts, and while I have many more desirable
things to do with my time than diddle with the bowels of NFS, I'm not
beyond it if someone can point me in the right direction.

-- 
  Aaron J. Grier | "Not your ordinary poofy goof." | agrier@poofygoof.com