tech-kern: Re: split LFS vnode locks

Subject: Re: split LFS vnode locks
To: Bill Studenmund <wrstuden@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 12/18/2002 20:36:51
On Mon, Dec 16, 2002 at 12:22:38PM -0800, Bill Studenmund wrote:
> > > You CAN NOT sleep with a vnode lock held.
> 
> Ok, you CAN NOT sleep "for long" with a vnode lock held.
> 
> Since the code in question here is waiting for the cleaner to run, I think
> it still counts as, "for long," and so sleeping w/ the lock held will be
> problematic. We might get away with it, we might not.

an interface specification which includes such notions as "don't hold this
lock for long" is just asking for trouble.  :-)

strictly speaking, this is a lock-ordering problem.  as long as resources are
acquired in a consistent order, we will avoid deadlock.  though that won't
necessarily help in avoiding the undesirable train-wreck behaviour.

at any rate, the LFS folks decided that in this particular case they
didn't need to hold the vnode lock here at all, so we're arguing
design philosophy from here on out.


> > the SVR4 vnode locking design would be a good place to start, since it
> > doesn't intrinsically have this problem.
> 
> I'm up for that discussion.

(first some definitions: "the vnode layer" is the file-system-independent
code such as vn_open().  "the file system" is file-system-dependent code.)

the basic difference is this:

in the 4.4BSD design, the vnode lock is taken by the callers of most VOPs.
further, vnode reuse is done by the vnode layer and uses the vnode lock
to synchronize vnode reuse with the acquistion of new vnode holds.
the advantage to this design is that many file systems don't have to
worry about locking very much, because the vnode layer does it all for them.

in the SVR4 design, the locking of vnodes operations is largely unspecified.
there are VOP_RWLOCK() and VOP_RWUNLOCK() which are used by the vnode layer
to provide POSIX read/write semantics, but that's it.  all other locking
is under the control of the file system, including that used in vnode reuse.
since the vnode layer doesn't know what locking is needed in vnode reuse,
it isn't involved in vnode reuse; that is completely under file system control.
vnodes are only made visible outside the file system via controlled interfaces
like VOP_LOOKUP() and VFS_VGET().  knowledge of what vnodes are free is
available only to the file system.

the train-wreck behaviour of the 4.4BSD design is due to the use of
the same lock for all VOPs and for gaining a hold on a vnode.
most (all?) of the file system VOP implementations are written such that
they don't drop a lock that the caller acquired (which is a good thing
in such designs), which unfortunately blocks new holds being acquired on
vnodes which are involved in an lengthy VOP.  if there were different locks
for these activities, then it would be possible to avoid the train-wrecks.

the SVR4 interface is good in that it delegates a lot of control to the
file system.  the main way it could really be improved would be if the
VOP_RWLOCK() interface also included the range of the reads or writes
that would be issued before the subsequent VOP_RWUNLOCK().  alternatively,
it would suffice if the file system were permitted to ignore VOP_RWLOCK()
and VOP_RWUNLOCK() entirely and implement POSIX semantics on its own
in VOP_READ() and VOP_WRITE(). (since there isn't any formal specification
of this stuff, it's not really clear if this is legal or not.)

the SVR4 implementation is lacking in that it doesn't provide any generic
locking or vnode reuse implementation, so every file system is forced to
roll its own.

so I'd think the best of both worlds would be an interface like the SVR4 one,
plus the enhancement above, with some helper functions to provide the
generic functionality (ie. the locking and vnode reuse model) that the
majority of file systems could use if they want to.  this is the philosophy
that I followed in the UBC design, where the interfaces give as much control
as possible to the file system, but where most file systems can just use
a generic implementation, sometimes with a wrapper around the generic code.
this seems to maximize flexibility while minimizing duplication of code.

-Chuck