netbsd-users: OT: NFS timeout question

Subject: OT: NFS timeout question
To: None <netbsd-users@netbsd.org>
From: Brian Grayson <bgrayson@austin.rr.com>
List: netbsd-users
Date: 12/06/2003 21:44:44

  (This is off-topic, but I have great respect for the knowledge
of NetBSD users and developers, and I and the sysadmins at work are sort
of stumped.)

  At work, we're currently having an NFS problem that
unfortunately I'm the only one who can demonstrate it (I'm probably
one of the most annoying^H^H^H^H^H^H^H^Hdemanding users at work!).  What
it appears is, when the fileserver is really overloaded (loads over 50 on
a 16-processor modern Sun), some scripts appear to have their NFS
read/getattr/etc.  operations time out instead of just hanging until the
server can handle the request.  This causes error messages like being
unable to change to my home directory, permission denied on my bin/
directory, etc.  I've been able to correlate my failures to entries in
/var/adm/messages saying "server XXX not responding."

  It can also cause file corruption if you use the >> operator --
Solaris sh does an open() request first, and if that fails, does a
creat().  If the open() on a _valid_ file fails due to NFS weirdness, it
ends up trunc'ing the file when it does the creat().  I've seen this
truncating behavior twice so far on my files.

  My limited NFS knowledge says, if you have a hard mount, your
system calls should _NEVER_ fail, they will just take a Really Long Time
to complete.  Am I wrong?

  Does anyone have any ideas on how to debug this further?  I tried
using nfsstat on the clients, but didn't see much different behavior
between a machine that seems to have lots of problems, and an older
machine that doesn't.  Since I'm a lowly user at work, I can't even
log on to the fileserver to dig around there, so I have to rely on
asking the sysadmins to dig around.

  Many thanks in advance.  It happens sporadically (about once
a day), and since we're in production mode the sysadmins are leery
of trying to force problems, since that could affect all the users,
so it hasn't been easy to debug.

  Brian Grayson