Subject: Re: Proposal: File system suspension - prerequisite for snapshots
To: Stephan Uphoff <ups@stups.com>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 08/15/2003 13:27:43
On Fri, 15 Aug 2003, Stephan Uphoff wrote:

>
> Bill Studenmund wrote:
> > On Wed, 13 Aug 2003, Stephan Uphoff wrote:
> >
> > > Wouldn't this be the ideal time to convert the VFS layer to file system
> > > internal locking ;-) ?
> >
> > Uhm, why?
>
> To improve locking granularity for more concurrency - especially for
> directory operations. Since fine grained locking requires file system
> specific knowledge I believe that locking should be file system
> internal.

Well, I disagree. I've worked both with leaf file systems and a lot with
layered file systems. Pushing the locking to the outside (as we do with
vn_lock()/VOP_LOCK() around calls) makes layering much easier. Actually,
it makes life MUCH EASIER; yes, that's a shout :-). It also keeps life
easier in general as we will not end up in too many complicated messes in
the fs.

> Why block a thread from looking up file "foo" in a directory just
> because another thread is currently creating file "bar" in the same
> directory ?

How do you know that the first thread's looking up "foo" and not "bar"?
The whole point is to keep things atomic, and what we do does it in a
rather direct manner. Yes, we could do other tricks, but it's a lot of
work for something I think is an uncommon case.

As for increasing concurrency, there are a number of things we can do. One
obvious one is to add complexity to the locking protocols and make use of
shared locks. In general, all steps of path lookup except for the last can
just use LK_SHARED locks. And we can have VOP_READ routines downgrade to a
range lock so that they don't have to hold a lock on the whole file for
the entirety of the read.

> > > However I think that it is possible even with the current VFS locking
> > > style to to put the suspend functionality inside a file system (gating
> > > below the VFS_ level).
> >
> > I think I prefer the explicit calls. If we stall all locks, then ALL file
> > system access (not just writes) will stop. Yes, I realize that atime
> > updating can be a "write", but there are lots of other places where we
> > will want to still be able to read from a filesystem while we're
> > snapshotting. Consider taking a snapshot of the file system that contains
> > the snapshot program, for instance. :-)
>
> I am not sure that we are thinking about the same thing. For me
> snapshots are file system internal operations that snap a point in
> time picture of the state of the filesystem. This snapshots are later
> exported (as a read only file system,directory trees ..) to the OS.
>
> Freezing out external access just makes it easier for the filesystem to
> take a sharp picture.

The idea is that we have to stop all operations. Inside or outside the fs
doesn't matter.

We have say X paths that call into Y VOP calls (50 calls at present). We
have N filesystems (about 12 that snapshotting makes sense for). Which is
bigger, X or Y*N? I expect Y*N is. That's why we do it outside of the file
system, not in it.

> Not sure where the snapshot program comes in except as a user interface
> to trigger the actual snapshot.

Depends on how the snapshotting works. If it's not a single syscall that
does all, then the snapshot program has to stall things, do the
snapshotting, then resume things. Since we're also talking about using
snapshotting as a way to facilitate suspending a system, chances are we
will be seeing the stall/do stuff/resume case.

We don't want to deadlock the system by not being able to page programs in
during the suspention. So we need to have a way of saying, "Even though
you are suspended, this operation needs to still happen." With the in-fs
approach, we now need to pass in a flag. With the outside-fs approach,
implicitly if an operation is started on a suspended fs, it needs to
complete.

> > From having worked with layered file systems, life is much easier if we do
> > the locking outside of the VOP calls.
>
> Who wants easy ? - Think of it as job security ;-)
> I agree stacking is a little bit harder but still doable.

Locking semantics interactions are enough of a mess. They don't need to
get messier.

> ( As far as I know fist (File System Translator) comes with a Solaris
> template)

The one option I would entertain about radically changing our locking
semantics would be to rip off the Solaris locking behaviors (which I
understand are standard SysV). Mainly since they've worked through most
every corner case. :-) But that would be a far-reaching change.

> It is also easier to write a  snapshot layer file system.
>
> -----
>
> On second thought I don't even know why a file system would need to make sure
> that no thread is holding a vnode lock.
>
> I really don't see the need to stop read but not write operations.

Depends on if the read is supposed to update atime, which is a metadata
write. In that case, it needs to stop the read.

> And I fail to see why a vfs_write_suspend API is a  prerequisite for file
> system snapshots.

It isn't. It is a prerequisite for one particular approach to file system
snapshots. The approach in question is the one FreeBSD used, so by using
it we maintain some BSD coherence, and we get to learn from their
mistakes.

> A vfs_suspend API would however be nice to have for Laptop suspension.
> ( Especially for network file systems )

Yep, see above.

Take care,

Bill