netbsd-users: Re: OT: NFS timeout question

Subject: Re: OT: NFS timeout question
To: None <netbsd-users@netbsd.org>
From: Christos Zoulas <christos@zoulas.com>
List: netbsd-users
Date: 12/07/2003 06:05:45

In article <20031206214444.A686@cs24279-4.austin.rr.com>,
Brian Grayson <bgrayson@austin.rr.com> wrote:
>  (This is off-topic, but I have great respect for the knowledge
>of NetBSD users and developers, and I and the sysadmins at work are sort
>of stumped.)
>
>  At work, we're currently having an NFS problem that
>unfortunately I'm the only one who can demonstrate it (I'm probably
>one of the most annoying^H^H^H^H^H^H^H^Hdemanding users at work!).  What
>it appears is, when the fileserver is really overloaded (loads over 50 on
>a 16-processor modern Sun), some scripts appear to have their NFS
>read/getattr/etc.  operations time out instead of just hanging until the
>server can handle the request.  This causes error messages like being
>unable to change to my home directory, permission denied on my bin/
>directory, etc.  I've been able to correlate my failures to entries in
>/var/adm/messages saying "server XXX not responding."

What are the errno's when system calls fail?
>
>  It can also cause file corruption if you use the >> operator --
>Solaris sh does an open() request first, and if that fails, does a
>creat().  If the open() on a _valid_ file fails due to NFS weirdness, it
>ends up trunc'ing the file when it does the creat().  I've seen this
>truncating behavior twice so far on my files.

eww, solaris sh.

>  My limited NFS knowledge says, if you have a hard mount, your
>system calls should _NEVER_ fail, they will just take a Really Long Time
>to complete.  Am I wrong?

No, you are right.

>  Does anyone have any ideas on how to debug this further?  I tried
>using nfsstat on the clients, but didn't see much different behavior
>between a machine that seems to have lots of problems, and an older
>machine that doesn't.  Since I'm a lowly user at work, I can't even
>log on to the fileserver to dig around there, so I have to rely on
>asking the sysadmins to dig around.

tcpdump?

christos