Subject: RE: Redoing file system suspension API (update)
To: Bill Studenmund <>
From: Gordon Waidhofer <>
List: tech-kern
Date: 06/22/2006 01:05:09
> > 2) Its implementation adds a layer above file systems AND a layer inside
> >    file systems.
> Given our current vnode locking protocol, I'm not sure that this
> is really
> an issue. While it'd be nice to not need upper-level calls (and for
> some things like VOP_WRITE() calls, we probably should just have
> the fs do
> it), there are places where upper-level code wants to perform an atomic
> sequence of actions. Having that upper level need to explicitly
> lock seems
> fine to me.

It isn't.

A sequence like

	VOP_X(vp, ...);
	VOP_Y(vp, ...);
	VOP_Z(vp, ...);

Should be

	VOP_XYZ(vp, ...);

Because VOP_LOCK() and VOP_UNLOCK() are entirely meaningless in
some contexts. NFS is an easy example. You can't get atomic sequences
out of NFS, but there's at least a chance of an NFS client implementation
getting VOP_XYZ() right (or as much as possible).

There should be no "gate" or "lock" above the VOP() call. At most
a vn_hold()/vn_rele(). I know I sound like a one note piano on this
point. But I'm a happy one note piano :)

> Snapshotting really needs the whole OS's help to get right. Applications
> and everything on down have to help, so that you get a good snapshot. So
> having things above and below the VFS/VOP layer help out seems quite
> appropriate.

There's truth in this. If I'm copying a huge file and a snapshot is
taken, then any backups made via that snapshot reflect the incomplete
copy. Snapshots make things achievable, but making truly robust backups
requires a lot of coordination above the kernel (application layers).
Database managers are a more relavent example.

There really isn't much the kernel -- specifically between the syscall()
and VOP() layers() -- can do to mitigate this requirement.

Snapshots really should be handled below the VOP() layer. The VOP()
model should be idealized rather than exposing every nuance.

> As an aside, if you have a journaling file system and want to make
> sure there is space in the journal before you start an operation,
> you need
> checkpoint calls in exactly the same places where you have the
> vn_start_write() and vn_stop_write() calls. So the idea of a call
> or calls
> that say, "I'm about to start a sequence of operations, please make sure
> everything's ok" is a general one, and something we need.

Emperically unsupportable. It is a fun issue, though. The problem
is there has to be a model above the VOP() layer for restarting operations.

This is hard to explain and to grok, actually. Here's a little try.
"I'm about to start a sequence requiring X resources." Define X.
Define X in a file system generic way. Waaaaaaay hard, and no matter
what it will be wrong at some point. Instead, something like setjmp()
is required like:

	while (begin_xa(xa)) {

where any step (x(), y(), z(), end()) can abandon a change due to
resource shortages (like journals) and retry later when there are
adequate resources by longjmp()ing to begin_xa().

This is a **HUGE** change to the VOP() model of file manipulation.
And, even if it was done, again won't work over NFS.

Transactional support in the file model is a fun thinkertoy.
Best to stay away from it.

> Take care,
> Bill

It's been a good discussion and fun to watch. It did
remind me that there was talk at one time of trying to
eliminate vn_lock() and striving for a more Solaris-like
VFS/VOP layer. Did that idea die?