Subject: missing files on NFS clients
To: None <netbsd-users@NetBSD.ORG>
From: Laine Stump <laine@MorningStar.Com>
List: netbsd-users
Date: 09/01/1997 13:45:09
We have 7 PPros, six of them running 1.2.1 and one running 1.2G. We're
using amd (tried both the old amd that comes with 1.2.1 as well as the
new on as comes with 1.2G) to NFS-mount a directory tree containing
about 2600 source files which build into about 4500 files. Our compiles
are done with gmake with the customs patch to allow sharing jobs across
multiple machines.

Our problem is that occasionally one of the client machines "loses" one
or more directory entries, leading a sub-gmake to give messages such as
"entering unknown directory" "getcwd failed()", etc. When I do an ls of
the directory on the server and compare it with an ls from the client,
sure enough, that directory is missing on the client. I can clear this
problem up by catting a bunch of stuff (something much bigger than the
disk cache) to /dev/null on the client; after that, all the entries
magically reappear (only to magically disappear at some later, equally
inconvenient time). It *seems* to happen more often when there is more
load on the machines, but that hasn't been quantified.

All our machine are configured with 

     options		"NMBCLUSTERS=1536"
     options		"NBUF=4096"
     options		"BUFPAGES=4096"
     options		DFLDSIZ=67108864	# initial data size limit
     options		EXTMEM_SIZE=130048	# size of extended memory

This gives us about a 16MB disk cache.

Has anyone else seen this problem? Could it be caused by our
artificially large cache? Any other suggestions? Is there maybe some
other parameter I need to bump up?

I haven't yet verified whether or not this is a client-side or server
side problem, nor whether upgrading to 1.2G fixes it (these are
production machines, and I'm a bit skittish about upgrading (especially
the servers) until there is an official release).

BTW, I thought the "double directory entry" problem had been fixed in
-current, but I still see it in the 1.2G client (which is using the
August 15 snapshot). What's up with this?