Subject: Directory cookies, the continuing story
To: None <tech-kern@NetBSD.ORG>
From: Frank van der Linden <frank@wins.uva.nl>
List: tech-kern
Date: 07/30/1997 15:52:24
I've changed the NFS readdir code to fix the 'ghost entry' problem. See
my message to this list on April 9th, titled "NFSv3 cookie jar" for
more detail.(*)

In short it comes down to doing what the v2 code (pre-lite2) did, with
as additions to make it work:

	1) a per-NFS node hashtable to do 64bit cookie -> 32bit blocknumber
	   translation for the sake of the buffer cache.
	2) Widen the path through the kernel to be able to handle
	   64bit cookies. Which comes down to making the cookies
	   returned by VOP_READDIR off_t, not u_long.
	3) Make the directory(3) routines read in all (most) of
	   an NFS directory in one go, to avoid troubles
	   with changed directories and BAD_COOKIE errors from
	   the server. Although most servers will probably
	   never return those. See the message mentioned above
	   for details.

There are issues wrt emulations that can only handle 32bit cookies.
They are unlikely to become a problem in practice, as I don't know
of any NFS server that returns cookies that have data in the upper 32 bits.
This is hard to work around, especially if you consider that binaries
running under emulation can read directories in unexpectedly sized
chunks. Like one entry at a time with (old Linux binaries). It can probably
be done right if a completely seperate NFS directory block cache is kept,
not using the normal buffer cache. With an additional lookup mechanism
in which you can find _any_ entry using an opaque key (a cookie), because
you can't count on the directory being read in the chunk amount that
you would like. This can be quite a bit of overhead both in time and
space for large directories (1000s of files), so I have decided not to
do this, not for now at least.

Coming to the userland part.. The directory(3) code should be able to
handle 64bit cookies. There are 2 ways to retrieve them. One is the
getdirentries() system call, using the last argument to it (an u_long *).
The other is lseek(). lseek() already returns a 64bit value, so this method
will work. However, currently, getdirentries() is used, with its (on
most ports) 32bit cookie value. To solve this, I can either:

	1) Always use lseek(), making the last argument to getdirentries()	
	   obsolete. A new getdirentries() syscall interface could
	   be created to reflect this (i.e. drop the last argument).
	2) Change the getdirentries() syscall interface to have the
	   last argument be an off_t *, not u_long *.

Changing the getdirentries() interface is not a big problem, the only
place I know of where it is used is libc, in the directory(3) functions.

I discussed it with Charles, and the conclusion was that 1) is as good
as anything. It would be in sync with the getdents() system call on
other systems. The only possible problem I can think of is a multithreaded
environment, where another thread does a getdirentries() call, right
after you've done one, but before you could do the lseek() to retrieve
the cookie information.

I should mention one other way to fix this: add a d_off field to
the dirent structure, which obsoletes the need for cookies to
be returned by the VOP_READDIR operation. struct dirent would no
longer match the on-disk version of FFS in that case, though.

I'd like comments on what to do wrt. the getdirentries() call.
What would be the best thing to do, either always use lseek()
or 'upgrade' getdirentries() ? About the other issues: please
comment if you have thoughts, but for now I'm pretty much settled
on making these changes, especially since it fixes the "ghost
entry" problem before 1.3, and because it has been discussed
before and I only got (positive) comments from Rick.

The seperate cache for NFS directory entries can be implemented
later if it is deemed necesarry

- Frank

(*) [There is an error in that message, it states that in v2 the offset
     cookies were not opaque. This is not true, they were supposed to be
     opaque according to the spec. However, you could basically count on
     them actually being offsets, as all servers seemed to implement them
     that way].