Subject: NFSv3 cookie jar
To: None <freebsd-hackers@freebsd.org, tech-kern@NetBSD.ORG>
From: Frank van der Linden <frank@wins.uva.nl>
List: tech-kern
Date: 04/09/1997 14:19:13
The following problem has been lying around for too long, so I'd like
to finally solve it. I first stumbled on it over a year ago, when
testing the NFSv3 code integrated into NetBSD against Solaris 2.5, and
had several other run-ins with it since. I saw that FreeBSD also encountered
the problem sometime last month, so I'll send this to freebsd-hackers as well
as tech-kern, to avoid Yet Another Duplicate Effort.


In NFSv2, directory offsets were specified as 32 bits, and they were
real offsets, i.e. they could be interpreted as numbers.

NFSv3 changed a couple of things. First of all, all offsets became 64 bits.
For some reason, directory offsets became opaque, they can no longer be
interpreted as numbers. In practice this will probably still be possible
most of the time, but you can't really take chances; it's the spec that
says you can't do it after all. Also, NFSv3 introduced cookie verifiers,
opaque entities returned by the server after each directory operation,
to be passed along by the client in subsequent operations. These verifiers
can be used by the server to see whether the directory has been modified
in the meantime. Invalid cookies can be detected that way.

Now, this all sounds like an improvement. However, this leaves some problems
for people implementing NFSv3. Like: what are the criteria for the server
to return a 'bad cookie' error, i.e: what constitutes a change in a directory
such that old offsets are now invalid? More seriously: what to do at the
client side when the server returns a 'bad cookie' error?

Rick Macklem's code uses the filerev field from the vnode attributes to
check whether a directory has been modified. On the other hand, Solaris 2.5
doesn't do any checks at all. Their server always returns a 0 cookie.
The problems start to appear..

The BSD filerev check turned out to be too strict for Solaris 2.5 clients,
or at least: for the user on the Solaris 2.5 client. Solaris has adopted
the policy that a 'bad cookie' error is passed up to getdents() as
an error (EINVAL). I guess it's one way to go. However, programs bail out
because of this at weird moments. Whoever expects getdents() to return
EINVAL because of this? I 'fixed' this about a year ago by removing the
filerev check from the BSD server code.

I'm not saying that Solaris' approach of passing on the error to the
userlevel is that bad. After all, what is a good method of recovering?
The BSD code simpy re-reads all of the directory blocks until it
hits the right offset again whenever it gets NFSERR_BAD_COOKIE. However,
suppose you have a directory of 3 blocks. You read the first block.
Your offset is now at the end of the first block. You delete all the
files in the first block. You want to read the 2nd block. You get
BAD_COOKIE. So then you start again from the beginning, until you
are at the wanted offset. However, the first block has disappeared now,
so your offset lands you at what was originally the 3rd block. You've
missed the 2nd block entirely.

The best way to solve this is probably to only use userland code that
doesn't mix create/remove/rename operations with getdirentries/readdir
operations. Things are actually mostly OK for the standard BSD utilities,
because they use fts(3), and this reads in the entire directory before
doing anything. Another way to go would be to have opendir() read in
the entire directory, so that other applications using that interface
would also be safe. Other systems that you may have as client might
still fail, but for them, all that you can do is take out the BAD_COOKIE
check entirely. A problem will be emulated binaries, such as SVR4 binaries,
that will do reads in 1048 byte chunks, mixed with dir operations.
Yet another possibility would be to do read-aheads in the NFS bio
code whenever a directory is read at offset 0, pulling in the whole
dir (within reasonable limits..).. ok that's just a thought.

Another issue is how to deal with the 64 bit cookies in the BSD code.
There's no 64 bit field in the buffer struct to store them. What Rick
did (I assume to minimize the changes to the rest of the kernel) is to
maintain a mapping between offsets and cookies per nfs node, iff VDIR
at least. This information is invalidated whenever an NFS dir buffer
is invalidated. The problem with the code implementing this, is that
it can't distinguish between EOF and a bad cookie. This can have some
unexpected results, i.e. the layer above thinks EOF has reached, when
it was in fact the result of invalidated cookie info. With the result
that you end up missing some files in the directory (a whole block).
I disabled nfs_invaldir to prevent this, letting the server take care
of signalling bad cookies. This has a (much less frequent) effect
of sometimes seeing duplicate files, so it's not a great solution either.
I know that FreeBSD's current code does distinguish between EOF and
a bad cookie, but while this is a fix, it is still prone to the error
of losing a block mentioned 2 paragraphs above.

All in all, I've come to the conclusion that patching directory(3) to
always read the whole directory might be the best thing to do. For
emulated binaries, well.. the emulation could try a read-ahead
(for in-kernel emulations that is), but it may be impossible to
get completely right.

Basically, this means:

	1) Get rid of nfs_invaldir and the seperate cookie lists.
	2) Be able to store a 64bit quantity in a struct buf.
	   This would either mean an extra field, or make daddr_t
	   64 bits wide.
	3) Pass the offset cookies up unmodified (interface change
	   to VOP_READDIR: u_long * -> off_t *). This change could
	   probably be avoided, but it would be very inconsistent not
	   to do so.
	4) The last argument to getdirentries(2) becomes an off_t *,
	   not a long *, so that the 64bit offset cookies can be
	   used by directory(3) functions.
	5) The kernel will make no attempt to recover from a BAD_COOKIE
	   error, and just make getdirentries(2) return EINVAL.
	6) opendir(3) will read in all of the directory if it sees
	   that the directory is on NFS. This is currently already
	   done for union directories, so it's a small change. opendir(3)
	   should restart the operation of reading the whole
	   directory if it gets EINVAL (i.e. the directory was modified
	   while it was reading), to make sure a consistent view
	   of the directory is obtained.

I might have missed some details, so please tell me if I did. If not,
I'd like to do an experimental implementation of this soon and test it;
the changes aren't that big.

- Frank