Subject: Re: Redoing file system suspension API (update)
To: Bill Studenmund <>
From: Juergen Hannken-Illjes <>
List: tech-kern
Date: 06/27/2006 12:52:57
On Mon, Jun 26, 2006 at 02:31:44PM -0700, Bill Studenmund wrote:
> On Mon, Jun 26, 2006 at 08:30:20PM +0200, Juergen Hannken-Illjes wrote:
> > On Mon, Jun 26, 2006 at 09:43:59AM -0700, Bill Studenmund wrote:
> > > On Sat, Jun 24, 2006 at 11:05:04AM +0200, Juergen Hannken-Illjes wrote:
> > > > On Fri, Jun 23, 2006 at 05:14:00PM -0700, Bill Studenmund wrote:
> > > > > 
> > > > > Especially if we go with the idea of having the file system grab and 
> > > > > release the mountpoint-transaction-lock (the current incarnation of the 
> > > > > "gate" you spoke of originally, the thing that makes us atomic w.r.t. a 
> > > > > snapshot), then it's not a problem. We just have the read and write code 
> > > > > in fifofs not take the transaction lock, and we're fine.
> > > 
> > > I'm sorry, but this is an important point. I have the feeling it was 
> > > missed.
> > 
> > Not sure I get it right: you mean taking the transaction lock for
> > read/write/ioctl in every file system while taking it for other operations
> > outside?
> > 
> > Looks difficult to maintain.
> How is it difficult to maintain?

We have to do it for all operations of all file systems.  And we need
thread-recursive locks as file systems call operations on other file systems.
Once an operation has the lock we cannot deny the lock to other operations
called from here.  Take unionfs's `copy-up' as an example.

And I'm not sure if it can be free of deadlocks doing it (with locked vnodes)
inside the file system.

> The idea is that we only use transaction locks above the file system if we
> have a real transaction.
> > > > > Such a change would limit our exposure to cases where someone is trying to 
> > > > > make a transaction that involves writing to or reading from a fifo. If you 
> > > > > try to do that, you get what you get.
> > > > 
> > > > Both specfs and fifofs need special care because they are no real file systems.
> > > > Their vnodes live in real file systems that may update meta data before or
> > > > after they call operations on specfs/fifofs.  These updates need the transaction
> > > > lock.  The real operations (as long as they dont go to disk devices) cannot
> > > > keep this lock because they may sleep forever waiting for data.
> > > 
> > > How many transactions, other than an actual write or read, will write or 
> > > read a fifo?
> > 
> > Its not only fifo, it is also non-disk VCHR and VBLK devices.  This information
> > is currently hidden outside VFS.
> > Unlocking/relocking in specfs/fifofs would be the same it is now.  Currently
> > there is already a VOP_UNLOCK/VOP_LOCK in specfs/fifofs.
> vn_lock() isn't a transaction lock, we use it as an atomicity lock. So as 
> long as you don't unlock/relock in the middle of your atomic operation. 
> i.e. you unlock/lock either before or (if you're weird) after the "read" 
> or the "write", you're fine. I'm assuming POSIX atomicity here.

Do we really need more than this kind of atomicity?

> > > Thus if we move the transaction/snapshot logging into the read or write
> > > call, we have fifofs and specfs skip that step, and we're fine.
> > 
> > See above.  I think it is easier to maintain if we try to keep the transaction
> > lock completely outside of file systems.
> Then we need buckets of them. If I understood your earlier discussions, we 
> then need transaction locking around every caller into the VFS/VOP layer. 
> That seems messier to maintain.

Buckets is not the right measure.  I suppose the number of lock pairs is
roughly the same for both ways.  Getting too much above VFS means we need
more vn_xxx helper functions...

> > > What do other OSs do?
> > 
> > No OS I know of has something like this.  Do you have a special OS in mind?
> Yes, they don't have this. But other OSs handle snapshotting. How do they 
> handle the suspension? Do they bother? If so, how do they do it? If they 
> don't, why do we have a problem and they don't?
> We're adding a new locking hierarcy. I think we should look at prior art 
> before we go too far. If we need the new hierarcy, we will do it. But 
> let's make sure we didn't overlook a cool idea somewhere else first.

FreeBSD has what we have now -- no surprise, we took ours from FreeBSD.  The
difference is FreeBSD has snapshots only for ffs file systems.

Solaris has snapshots for ufs file systems.  If I remember right they got
file system locks before snapshots.  It has different lock levels:

	unlock / name lock / write lock / delete lock / hard lock / error lock

From a quick tour through OpenSolaris it uses
ufs_lockfs_begin()/ufs_lockfs_end() calls in most operations where
every ufs_lockfs_begin() has a mask describing the lock levels it should
wait on.  It takes care of recursive operations (state is stored in a linked
list of the thread struct), waits or errors if needed.  Silence is aquired by
an operations counter becoming zero.
Looks very close to your approach.  But Solaris has no vnode locks.

Juergen Hannken-Illjes - - TU Braunschweig (Germany)