Subject: Re: NFS V3 problems?
To: Jesus M. Gonzalez <jgb@gsyc.inf.uc3m.es>
From: Frank van der Linden <frank@wins.uva.nl>
List: current-users
Date: 03/03/1997 13:25:18
Quoting Jesus M. Gonzalez,

> 	Last Friday we saw something similar. While doing some
> compilation on machine hola, on a NFS mounted directory belonging to machine
> raistlin, the disk of raistlin became full. Some space was released in 
> raistlin (by deleteting things in other directories). Since then, things 
> in hola began to work in strange ways.
> An "ls" in hola, of the directory where the compilation had taken 
> place showed only part of the directories and files which were 
> actually present there (according to an "ls" in raistlin). Killing the
> nfsd in raistlin was enough to get a consistent view of the files.

If files are removed in a directory on the server while you're halfway
through reading that NFS mounted directory on the client, inconsistencies
will occur. What makes it worse is that the buffer cache can't handle
the 64bit offset cookies that you need to keep for NFS directories, so
they're stored seperately, and invalidated seperately. This can lead to
some other problems, so as a "it's not as bad as the other options"
solution I disabled that last invalidation. It doesn't quite solve the
problem. Other attempts at solving this, like I've seen in the FreeBSD
code, don't quite cut it either. There's no perfect solution to it.

What makes it even worse, is that the NFSv3 protocol has the means to
detect these inconsistencies, but nobody quite knows how to handle
the error returned by the RPC in these cases. I had to disable
the check in the NetBSD server code to keep Solaris 2.5 clients happy
(Solaris 2.5 servers don't do these checks at all, and clients
just bailed out in the getdents() system call when they got the RPC error).

I'm working on a solution that is still not perfect, but better than what
I've seen so far. However, this solution has some implications outside
of the NFS code, so it needs some more thought. I'll bring it up on
tech-kern when I've thought it out.

Having said all this.. if you didn't remove any files in the compilation
directory itself on the server, it is strange that you were seeing this.
It's even more strange that killing the nfsd (and restarting I would
assume ;-)) would solve it.. To make sure: you killed all the nfsds
on the server and then you restarted them, and after that things were ok?

- Frank