tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

lock coupling (was: Re: Cleaning up namei)

(As this is likely to be controversial, I'm giving it its own subthread)

On Wed, Mar 19, 2008 at 02:12:36PM -0700, Bill Stouder-Studenmund wrote:
 > > This specifically abolishes the traditional lock coupling that we (and
 > > ~everyone else) uses while searching down the tree. There is no need
 > > for it: all it prevents is, if two processes are following the same
 > > path, one catching up to and passing the other on the way down.  In
 > > general this is harmless, because lookup is a read operation and read
 > > operations commute freely with one another.
 > I'm not sure I like this. Well, I don't mind namei_parent() returning an 
 > unlocked node. I however think it would be better to keep the lock-walking 
 > we do now and just unlock at the end.

That is definitely pointless. The lock coupling is of even potential
interest only in the end cases. (See below.) 

It is also not neutral. Not having to mess with locks is one of the
things that makes the new cleaner architeecture possible.

 > [...]
 > You're talking about a wide-sweeping change to the kernel. I suggest you 
 > leave the lock stepping as-is for the most part just to improve the 
 > chances of a successful transition. If you can make this proof, you can 
 > always change the locking later.

Sketch of the proof follows:

First, consider the operations that involve pathnames. These all begin
by looking up the pathname and then performing some set of operations
(either read or write) on the target vnode(s) at the end. The target
vnode(s) are either the vnode for the last name in the path, the
second-last name in the path, or both.

Since for all such operations the ultimate operation must be atomic,
to be locked correctly (note that many of the current fs-dependent
implementations of these, particularly of rename, are not locked
correctly) they must lock the parent (directory) vnode, if it's
affected, and then the child (target object) vnode, and then hold both
these locks until the operation is complete.

We shall assume all ultimate operations are locked correctly, because
where they aren't they need to be fixed.

This means that without loss of generality we can write the general
form of a path operation as follows:

   L L L L L L L ... L Op

where L is a single lookup step and Op is the ultimate base operation
on the target vnodes.

Now consider an arbitrary interleaving of two such operations, one
with lookup steps L and op Op, and another with lookup steps M and op

To be correct, any interleaved execution must be equivalent to either

   L L L L L ... Op M M M M M ... Pq


   M M M M M ... Pq L L L L L ... Op

that is, one operation must come strictly before the other.

The first argument is that the sequence "L M" is equivalent to the
sequence "M L". This follows trivially because lookup is a read-only
operation; the state is the same before and after L, so M must produce
the same result regardless of the ordering, and vice versa.

Without loss of generality, assume that Op comes before Pq. Thus every
execution has the form

   (L|M)... Op M M M ... Pq

which is equivalent to

   L L L ... M M M ... Op M M M ... Pq

and so the question is whether the M operations can be moved past Op.
Under what conditions can a lookup return a different result if some
other operation has happened first? Only if Op has either added or
removed the name for the vnode M was going to look up. Adding and
removing other names, or operating on other vnodes entirely, cannot
have any effect.

In the case where the name is removed, moving M after Op will cause M
to fail, where it previously would have succeeded. However, this is
consistent with an execution where the complete M/Pq operation happens
after the L/Op operation: L/Op happens first, removes a directory
M/Pq needs to transit, and thus M/Pq fails with ENOENT; consistency is

Similarly, in the case where the name is added, moving M after Op will
cause M to succeed, where it previously would have failed. But this is
again consistent with the execution where M/Pq happens entirely after

(In both these cases what really ought to be proven is that it is ok
to allow the M/Pq operation to begin looking up at all before the L/Op
operation completes, that is, that "M Op" is still equivalent to "Op
M" even if Op affects the vnodes used by M. As things currently stand,
we simply assume this to be the case; it is not quite as easy to
prove, but it follows from the fact that only empty directories may be

The above only assumes that each lookup L or M is itself an atomic

Lock coupling provides no benefits whatsoever in intermediate places
in the path, because all it does is prevent L and M from exchanging
places, which has no effect.

The only interesting cases are in the final locking for Op and Pq,
where lock coupling prevents certain orderings that are otherwise
possible. The case where you can stat() and get back a statbuf with
nlinks 0 is one of those. Without lock coupling, you can get

   L M L M Unlink Stat

(where L goes with Unlink and M goes with Stat) because M retrieves
the final vnode for stat before Unlink locks both its parent dir and
it and unlinks it; then Stat does the stat and finds that the link
count is 0. To show why this is prohibited in the lock-coupling world
we need a more detailed and less compact notation:

Without lock coupling:

   unlink("/a/b")       stat("/a/b")
   get root
                        get root
   lock root
   lookup a
   unlock root
                        lock root
                        lookup a
                        unlock root
                        lock a
                        lookup b
                        unlock a
   lock a
   lookup b
   lock b
   unlock b
   unlock a
                        lock b
                        stat (finds nlinks 0)
                        unlock b

With lock coupling, the stat doesn't unlock a until after it's locked
b, so if it locks a first it completes before the unlink can happen.
(Alternatively, if the unlink locks a first, it can't lookup b in a
until b has already been removed, and it fails with ENOENT.)

When I say that the proof is a lot of work, I mean that to do it one
has to set up every case like this in both the non-lock-coupling world
(which is comparatively simple) and the lock-coupling world (which
isn't) and demonstrate that they're comparable under some set of
formally specified criteria. And because no such proof would be
remotely credible if just written down on paper, it has to be done
using some theorem-proving tool, and that means one has to teach the
theorem-proving tool about pathnames and vnodes and equality of vnodes
and all kinds of tedious things that aren't really part of the
problem. Which is why I'm not in a huge hurry to do it, although I
still might sometime.

It would be a good bit simpler to just prove that the non-lock-
coupling world is sound, for some definition of sound, except that it
isn't, any more than the lock-coupling world is sound.

The problem with link that I alluded to before is that link looks up
two paths, and so if the two paths overlap, it locks and unlocks the
overlapping portions twice, which means that if something else touches
those vnodes in between it will get an inconsistent view.

Nothing I'm doing by changing namei and lookup is going to either
solve that problem or make it any worse.

 > My understanding was that when we replaced lockmgr locks for vnodes, we 
 > used reader-writer locks. If I understand correctly, we could use reader 
 > locks while walking down.

Yes, but this doesn't matter much.

 > > [...] I think the long-term solution to this and other
 > > problems, like the one I alluded to above in link(), and the issue
 > > with stat(), is to add an additional layer of timestamp-style
 > > synchronization to vnodes; this would allow killing and restarting
 > > *any* inconsistent operation, including those involving lots of "."
 > > and "..". (If you don't know how timestamp synchronization works, I
 > > can explain; or check a database textbook.)
 > I was about to suggest something that I think is about the same. My 
 > suggestion is that we add a counter to a struct vnode and if it's a 
 > directory, we incirment that counter each time we do something that 
 > changes its name space. Like create, rename, or delete of something in it. 

That is the same, yes. Or at least it will be once you sort out the
obvious difficulties. :-)

 > My suggestion is to add it as part of this work. Not everything has to USE 
 > it yet, but make it part of the new name buffering.

I have no objection to doing it (or even doing it first), but I don't
think it should be part of the same "work". It's not directly related,
touches different things, etc., etc.

It can be hidden entirely within vn_lock and vn_unlock, provided of
course that I go through and replace all the direct VOP_UNLOCK calls
with vn_unlock. But I've been intending to do that anyway, because
symmetry is good.

 > Among other things, this will make the "relookup" stuff done in rename 
 > much easier. If the counters (current and saved) are the same, just keep 
 > going. If they differ, restart the whole thing.

What relookup stuff? Did you see any relookup in my new pseudocode? :-)

(When all the rename implementations are corrected -- which to do
properly requires using namei_parent and namei_once instead of the
current namei interface -- there will be no further need for

David A. Holland

Home | Main Index | Thread Index | Old Index