Subject: Re: (long) NFS misbehaving under -current?
To: Brian C. Grayson <bgrayson@marvin.ece.utexas.edu>
From: Frank van der Linden <frank@wins.uva.nl>
List: current-users
Date: 10/09/1998 10:50:53
On Wed, Oct 07, 1998 at 11:56:51PM -0500, Brian C. Grayson wrote:
>   Is anyone else experiencing bad behavior from NFS?  I built a
> new system (sup -o, rm -rf kernel dir, config and build
> kernel, reboot, cleandir userland, build userland, reboot),
> in hopes that it was due to stale .o files or something, but to
> no avail.

Is your kernel compiled with egcs by any chance? If so, try recompiling
it with gcc. egcs has caused problems through codegen bugs, it sure
did in several ways for me.

Nothing changed in the NFS code between 1.3G and 1.3H, so I can't find
any other reason for this problem. The only fairly recent change
was that the pool allocator is now used for some things. It could
be that this has uncovered a problem (like it did with the SCSI code),
but I don't think this is the problem in this case.

In the past, I have had problems with Linux NFS servers that would
actually force you to use 1024 byte request sizes, because the
broken Linux driver (or network stack) would mess up otherwise.
But since it worked for you before..

Btw, the tcpdump output looks a bit weird. Are you running tcpdump
on the Linux host? Try running it on a NetBSD host. These lines especially
looks bad:

> 23:47:49.637243 sim6.nfs > marvin.3234575966: reply ok 1460 readdir offset 1 s
ize 688326671 eof (DF) (ttl 64, id 38174)

Portnumber 3234575966? size 688326671? Not likely.. Also:

> 23:47:49.638503 sim6.nfs > marvin.1869640494: reply ERR 1460 nop (DF) (ttl 64,
 id 38175)

..what's that doing there? A rejected RPC? That must mean packet corruption.
But it's more likely that tcpdump is messing up here.

Clients not recovering well from dead servers is an unrelated problem.
Sometimes it may actually just take a long time to recover, there's
an exponential backoff going on. But it should recover eventually. I know
that sometimes it doesn't, and that soft mounts also may fail to do
what they're expected to. It's been on my list for a while, but..

- Frank