current-users: Re: Bizarre bug in NFS/mmap code

Subject: Re: Bizarre bug in NFS/mmap code
To: None <current-users@NetBSD.ORG>
From: David Jones <dej@eecg.toronto.edu>
List: current-users
Date: 07/25/1995 20:01:55
> 	I have a file NFS mounted from a SunOS 4.1.3u1 fileserver onto various
> 	machines. One of which is a sun4 running NetBSD -current (muon)
> 	It used to be       "abs culf dp myth norm nostra socrates tjfas tugs"
> 	It was changed to   "abs culf dp myth norm nostra tjfas tugs"
> 	'cat' on muon gives "abs culf dp myth norm nostra tjfas tugs"
> 	'vi' on muon gives  "abs culf dp myth norm nostra socrates tj"
> 
> >Fix:
> 	Truely unknown - I guess this is mmap() interacting with nfsiod or
> 	similar. The situation seems stable (which does cause the odd problem
> 	:) I'm totally open to suggestions as to things to try to better
> 	identify this!

It has been my experience that mmap() and NFS generally don't mix.  This is
as true on SunOS as it is for NetBSD.

Last year for a course I wrote an X11-based plotting program for a circuit
simulator.  It had to update the plot while simulation was still in progress.
Since the plot data was written by the simulator as a nice array structure,
I decided to use mmap().  The antics I had to go through to prevent me
from coring out were unbelieveable.

The fundamental problem is that virtual memory is in effect a cache: your
core memory caches data from memory objects, usually files and swap space.
However, NFS does not address cache coherency issues at all.

NFS has two caches: an attribute cache which caches information contained
in inodes, and a data cache, which caches the file data itself.  By default,
on SunOS, an attribute cache entry is valid for 5 seconds after which it
must be refreshed.  A data cache entry is valid for 30 seconds, after which
it is re-fetched if examination of the inode reveals that the file has been
updated.  All this is quite well documented in Hal Stern's "Managing
NIS and NFS".  Recommended for serious students of NFS farts.

Beyond that, there is no cache coherency protocol.  This is a simple approach
and it works well for most applications.  Think about your local workgroup.
The stuff that you access via NFS is either binaries, which don't change,
or your own workspace, which you typically manipulate from only one host
at a time.  If you do jump between hosts, it is assumed that accesses to
an object from different hosts are spaced at least 30 seconds apart to allow
all caches to invalidate.

Consider VM and mmap(): the VM system must maintain consistency with the
backing object.  The Mach VM system uses special pagers to do this, and I
think the vnode_pager, which allows mapping to files, does a good job of
keeping VM in sync with local filesystem data.  But it doesn't do so well
with NFS.

The main problem occurs when you access the file while the system still
has stuff cached.  Let's assume you access a file just before it's updated
on another node.  In this situation, your caches are out of date due to
the update.

If a change is made on a remote host before the attribute cache expires
then you won't see the change.  If a change is made after the attribute
cache expires but before the data cache expires, then you won't see any
changes that were made to the fragment of the file present in the data
cache.  If you read past that fragment, you may get new data while keeping
the old data.

In my case, the first few words in the plot file contained the number of
data vectors present in the file.  I would read this value, then use it
to limit my read of the data itself.  However, if an update was just made,
then my access to the beginning of the file would get new data (since this
data timed out in the data cache).  When I seeked to the end and read before
the attribute cache got updated (the data got updated right away due to a
miss), I'd core out because the VM system thought I was accessing beyond
end-of-file.  I ended up doing a stat(), figuring out the maximum number
of vectors the file could hold, and not accessing past the minimum of that
value and the value written to the beginning of the file.  To summarize:

- Update-read of end-of-file, attribute cache updated.
- Remote host writes new vector to file.
- I read low page.  It gets read for real since data cache entry expired.
  Attribute cache is still active.
- I read new data from end of file.  Since attribute cache has not expired,
  I core out.

So what's a poor user/programmer to do?  First of all, be aware of these
NFS problems.  When designing an application, consider which data will be
NFS mounted (i.e. shareable), and which will not.  Limit operations that
will fail due to weak NFS semantics to those files which you are sure are
not being shared.  In the specific case above, perhaps vi should not map
files, if that's indeed what it does.  It's probably not worth giving
up the efficiency of VM for.  We just have to be aware of these things
and be careful.

-- 
David Jones, M.A.Sc student, Electronics Group (VLSI), University of Toronto
           email: dej@eecg.toronto.edu, finger for PGP public key
         For a good time, telnet torfree.net and log in as `guest'.
          Click me!