netbsd-users: Re: OT: NFS timeout question

Subject: Re: OT: NFS timeout question
To: Christos Zoulas <christos@zoulas.com>
From: Brian Grayson <bgrayson@austin.rr.com>
List: netbsd-users
Date: 12/07/2003 23:18:06
On Sun, Dec 07, 2003 at 06:05:45AM +0000, Christos Zoulas wrote:
> In article <20031206214444.A686@cs24279-4.austin.rr.com>,
> Brian Grayson <bgrayson@austin.rr.com> wrote:
> >  (This is off-topic, but I have great respect for the knowledge
> >of NetBSD users and developers, and I and the sysadmins at work are sort
> >of stumped.)
> >
> >  At work, we're currently having an NFS problem that
> >unfortunately I'm the only one who can demonstrate it (I'm probably
> >one of the most annoying^H^H^H^H^H^H^H^Hdemanding users at work!).  What
> >it appears is, when the fileserver is really overloaded (loads over 50 on
> >a 16-processor modern Sun), some scripts appear to have their NFS
> >read/getattr/etc.  operations time out instead of just hanging until the
> >server can handle the request.  This causes error messages like being
> >unable to change to my home directory, permission denied on my bin/
> >directory, etc.  I've been able to correlate my failures to entries in
> >/var/adm/messages saying "server XXX not responding."
> 
> What are the errno's when system calls fail?

  Unfortunately, we don't have truss output.  And since it's the sh
forked off by cron to handle my cron job (in this case), I can't
slap a truss on it easily.  I could try to get the sysadmins to put
cron itself on a truss, though -- thanks for the idea.

> >  My limited NFS knowledge says, if you have a hard mount, your
> >system calls should _NEVER_ fail, they will just take a Really Long Time
> >to complete.  Am I wrong?
> 
> No, you are right.

  Whew!  I don't want to blow my reputation at work....

> >  Does anyone have any ideas on how to debug this further?
> 
> tcpdump?

  The problem is, it only happens about once a day, and only on one
of around 7 machines (I have the cron entry installed to run every 5
minutes on numerous machines, and only get around 1 failure a day
total), and we have four (overtaxed) servers for an 800-person chip
design and verification group.  The traffic from just an hour, much
less a day, on just those 7 machines should be enormous.

  I can see if the admins are willing to take several machines off the
batch farm so that the tcpdump traffic to them would be much lighter
and easier to trace.

  Thanks for the ideas and nudges!

  Brian