netbsd-users: Re: Recoverable Network File System?

Subject: Re: Recoverable Network File System?
To: Sean J. Schluntz <schluntz@workofstone.com>
From: Greg Troxel <gdt@ir.bbn.com>
List: netbsd-users
Date: 12/12/2003 07:41:03
I have been running coda for around 5 years, and mostly winning,
although I wouldn't call it stable for production.  But I am a
particularly abusive user of coda, since I do all of the following at
once:

1) operation over a 28.8 line, so that venus (the client cache) is
   usually in write-disconnected mode (write-behind caching,
   essentially, so that modifications are logged locally and are
   trickle reintegrated after a hold time of 30s or so)

2) All coda traffic uses transport-mode IPsec ESP.  This hasn't caused
   that much trouble recently; coda's port usage plan has gotten
   simpler over the years, and my usage has shaken out a few bugs
   where stuff was sent on the wrong port occasionally.

3) Use of the 'hoard' feature, which walks the cache to ensure that
   all files in a defined set of directories are in-cache and up to
   date, so that when you lose connectivity you can still use the
   files.

4) Use of cfs, with ciphertext in coda.  This makes repair hard, since
   you have to repair conflicts in ciphertext which is hard to follow.

5) Writing lots of data while in write-disconnected mode.

Despite all this, I have almost never lost any data.


My problems have fallen into three classes:

1) Kernel bugs where vnode refcounting is wrong and leads to panics.
   I think these are all fixed now, or at least the ones I run into.

2) Repair/reintegration bugs where venus thinks there is a conflict
   and there isn't.   But if you are running with a high-speed,
   reliable network, and do 'cfs strong', you should avoid
   write-disconncted mode almost all the time.

3) limits in coda e.g. directory size.  A mirror of the entire
   internet-drafts directory gets to have a directory size greater
   than 256k or something, and coda chokes ungracefully on this.

I have essentially zero trouble these days with the machines that are
on the same Ethernet as the server, and also no trouble with a machine
at MIT that's 9 hops away but only 3ms.

The coda security approach has two parts: acls, which are afs-like and
quite sane, and transport security/authentication, which as
implemented is completley bogus (due to previous export control
rules).  So I run over IPsec.

So yes, this is another 'mostly works' review, but if you don't push
it with disconnected operation or huge directories, I think it will
work.  And these days disconnected operation usually works; I just had
a venus repair wedge this week, but I don't remember when the last one
happened.  If all needed bits are on the server already, one just does
'venus -init' to start the client over with an empty cache.

-- 
        Greg Troxel <gdt@ir.bbn.com>