current-users: Re: (long) NFS misbehaving under -current?

Subject: Re: (long) NFS misbehaving under -current?
To: Frank van der Linden <frank@wins.uva.nl>
From: Brian C. Grayson <bgrayson@marvin.ece.utexas.edu>
List: current-users
Date: 10/09/1998 09:12:08
On Fri, Oct 09, 1998 at 10:50:53AM +0200, Frank van der Linden wrote:
> On Wed, Oct 07, 1998 at 11:56:51PM -0500, Brian C. Grayson wrote:
> >   Is anyone else experiencing bad behavior from NFS?  I built a
> > new system (sup -o, rm -rf kernel dir, config and build
> > kernel, reboot, cleandir userland, build userland, reboot),
> > in hopes that it was due to stale .o files or something, but to
> > no avail.
> 
> Is your kernel compiled with egcs by any chance? If so, try recompiling
> it with gcc. egcs has caused problems through codegen bugs, it sure
> did in several ways for me.

  gcc -v reports 2.7.2.2+myc1, which means no egcs, right?
It appears that the bad-readdir stuff is related to the recent
version of amd -- when I run an old version of amd, Things Are
Good.

> Btw, the tcpdump output looks a bit weird. Are you running tcpdump
> on the Linux host? Try running it on a NetBSD host. These lines especially
> looks bad:
> > 23:47:49.637243 sim6.nfs > marvin.3234575966: reply ok 1460 readdir offset 1 s

  The tcpdump was from a NetBSD machine.  From the source
code, that number is the rp->rm_xid field (print-nfs.c, around
line 270).  Are these the same as port numbers, or is tcpdump
intentionally printing out rm_xid (`RPC message transaction
ID'???) instead of port?  If you disable the return statements
after the nfs*_print, it will fall through to the general packet
printout, and the port numbers will be correct.  (If this is
correct, I'd suggest modifying nfs*_print to print something like
<host>.rpcxid<rm_xid>, to avoid confusion with the usual
<host>.<port> format.)

<the ERR packets>

> ..what's that doing there? A rejected RPC? That must mean packet corruption.
> But it's more likely that tcpdump is messing up here.

  Those go away after using the old amd.  So the new amd is
messing things up, apparently.  This is _very_ repeatable using
the new amd -- the system is basically unusable with Linux
servers.

> Clients not recovering well from dead servers is an unrelated problem.
> Sometimes it may actually just take a long time to recover, there's
> an exponential backoff going on. But it should recover eventually. I know
> that sometimes it doesn't, and that soft mounts also may fail to do
> what they're expected to. It's been on my list for a while, but..

  Actually, I don't have problems with dead servers -- that's
what others added to the thread.  My other NFS problem was that
things like a chmod didn't take effect immediately (sometimes
with more than 20 seconds of lag time).  On systems built a few
months ago, the changes are `instantaneous'.

  Brian
-- 
"The nuclei of the rare earth elements look either like watermelons or like
  squashed pumpkins." - Dr. Dunning, PHYS 202