Subject: Re: Nfs clients get frozen when NFS server crashes...
To: None <netbsd-users@netbsd.org, tech-kern@netbsd.org>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 10/03/2000 23:39:15
[ On Wednesday, October 4, 2000 at 08:27:29 (+1100), Robert Elz wrote: ]
> Subject: Re: Nfs clients get frozen when NFS server crashes... 
>
> This is my standard mount ...
> 
> munkora-mmlab:/local/tech /local/tech nfs rw,nosuid,-b,-P,-c,-i,-s
> 
> which (aside from nodev which certainly isn't relevant, but I should add)
> translates into much the same as you are using.

On the other hand I have no trouble keeping my diskless workstation
working after the server reboots.  I use only 'rw' in the mount flags
(and right now my server and client are still running 1.3.2).  At one
point the server was being hit with some bogus IP packet that was
causing an alignment fault in the routing code and was rebooting as much
as every few hours.  Meanwhile the diskless client had an uptime of well
over a month (I wish I could say as much for its Xserver process!).

The diskless client does become totally stuck (sometimes even the
locally running xclock stops ticking if the Xserver (or xclock) has to
page), as does any process on a non-diskless client that either tries to
do a plain "df" or tries to access the mounted partition in any way.
However it usually only takes a very few seconds for the clients to wake
up after the server is fully running.

I've been using "-b,-i,rw" on one of the diskfull clients that mounts
home directories from this server, but that doesn't seem to make any
difference, at least not for short-term (< 4 hr) downtimes.  Oddly
enough this does *not* allow one to interrupt a "df", which I would
consider a bug.

However I don't really find any of this behaviour any different than it
was in my previous SunOS-4.1 environment either.  In fact I don't
remember ever having to reboot an NFS client after the server rebooted,
on either SunOS-4 or any NetBSD....

I do suspect the weak UDP checksum for causing corruption in my diskless
client and eventually causing it to crash though.  Normally I have no
"bad checksum" counts, but at times there seem to be inexplicable
crashes of processes that would tend to indicate something has been
buggered up unexpectedly.  The only other explanation would be either
memory errors, but this is a Sparc-1+ with true parity memory and I'd
expect to get at least an indication of memory problems before things
fail completely; or a very wild pointer in the kernel somewhere.  One of
these days I'll get around to upgrading these machines....

The only thing I haven't figured out is why my window manager (ctwm) on
my NCD-X11 stations freezes (even though it's running fine on an
otherwise unaffected server).  It does have its stderr redirected to a
file on the NFS partition, but normally nothing's being written there
and I wouldn't expect fflush() to do anything if there was nothing to
write.

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>