tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: lock coupling (was: Re: Cleaning up namei)



On Wed, Mar 19, 2008 at 10:17:34PM -0400, der Mouse wrote:
 > I'm not sure I understand what you're trying to prove.  

That no-lock-coupling is no worse (in terms of possible consistency
violations) than lock-coupling, except perhaps for a couple corner
cases of limited concern, and particularly so for lookups other than
the final step or two before performing a "real" operation.

 > Initial state:
 > 
 > / directory, inum 2
 > /foo directory, inum 3
 > /foo/a symlink to da, inum 4
 > /foo/b symlink to db, inum 5
 > /foo/c directory, inum 6
 > /foo/da directory, inum 7
 > /foo/db empty directory, inum 8
 > /foo/da/sub directory, inum 9
 > /foo/da/sub/x symlink to /foo/b/sub/y, inum 10
 > /foo/da/sub/y symlink to /foo/c/sub/z, inum 11
 > /foo/c/sub directory, inum 12
 > /foo/c/sub/z plain file, inum 13
 > 
 > Consider two operations racing: (1) look up /foo/a/sub/x for plain-file
 > open; (2) rename("/foo/a","/foo/b").  If (1) happens either completely
 > before (2) or completely after (2), it fails, showing ENOENT at one
 > stage or another.

Right, but that's because (even though it doesn't contain "..") it
crosses through at least one inode (specifically, foo (3)) more than
once without holding them locked for the entire operation.

Thus, the second transit through foo may not be consistent with the
first one, and if you wedge in the rename as described it isn't, which
is why you get an inconsistent result out.

The thing is, though, the same thing happens if you do the lock
coupling, because while the rename is going on the lookup thread is
off working in places the rename doesn't touch.

In other words, yes, that's a consistency bug, but no, it's not one
I'm introducing; it already exists, and removing the lock coupling in
lookup is not going to make it any worse. Or any better.

(Digression on the general case of serializability follows.)

To get strong serializability with locks, you need to do two-phase
locking. (That is, not release any locks until you're done acquiring
locks.) This is what DBMSes do. Unfortunately, it's not really all
that feasible for vnodes. With reader-writer locks it wouldn't cause
as much contention as it would with exclusive-only locks, but it would
almost certainly still be enough to be a problem on a busy system. And
just maintaining the list of locks you're holding would be a pain, you
need a much more sophisticated lock manager, probably a deadlock
detector, etc., etc., etc.

One can also get strong serializability by using independent locks
(DBMS guys call them "latches") on every object and using any of
several so-called "timestamp" algorithms to make sure the uses of the
data protected by those locks are consistent. (So-called are because
the stamps are rarely actual times of any kind.) This is perfectly
feasible to do in a kernel; in fact, I've done it. It's not entirely
trivial, and there are some catches, but I would not be opposed to
deploying it for vnodes, and I think it can be done without a drastic
change footprint too... provided we allow vn_lock to fail. (hi yamt)

Anyway. In general any violation of two-phase locking can lead to a
consistency problem, including releasing locks early even if you
*don't* reacquire them again later. (Even read locks!) There are
preconditions that can make such things not possible, but they aren't
entirely trivial, and also are easily gotten wrong, so I won't go into
them here. I'll just note that one of the ways to *encourage* such
behavior is to have a data structure that can be traversed in opposite
directions, such as a doubly-linked list or a tree with parent
pointers. (Such as a filesystem namespace...)

The following contrived example demonstrates how releasing write locks
early can affect a linked list:

   Suppose you have a doubly-linked list with three nodes, A-C. Each
   node has an integer stored in it, and you want to be able to
   compute the sum of these integers, postincrementing each, starting
   from either end. The naive implementation goes like this:

        forwards:
        tot = 0;
        for (x = A; x; x = x->next) {
                lock(x);
                tot += x.val++;
                unlock(x);
        }

        backwards:
        tot = 0;
        for (x = C; x; x = x->prev) {
                lock(x);
                tot += x.val++;
                unlock(x);
        }

   If you run these together, in processes P and Q, with all the
   values starting at 0, you (can) get this:

        P               Q
        ------------------------
        P.tot = 0;      Q.tot = 0;
        lock(A);        lock(C);
        P.tot = 0;      Q.tot = 0;
        A.val = 1;      C.val = 1;
        unlock(A);      unlock(C);
        lock(B);
        P.tot = 0;
        B.val = 1;
        unlock(B);
                        lock(B);
                        /* B.val is now 1 */
                        Q.tot = 1;
                        B.val = 2;
                        unlock(B);
        lock(C);        lock(A);
        /* C.val and A.val are both 1 */
        P.tot = 1;      Q.tot = 2;
        C.val = 2;      A.val = 2;
        unlock(C);      unlock(A);
        result: 1       result: 2                       

   Afterwards, the value stored in each node is correct (2 in all
   cases) but processes P and Q got an inconsistent result out. In all
   possible noninterleaved executions, only multiples of 3 can be
   returned.

   Note that in this case lock coupling deadlocks instantly - but even
   if it somehow magically didn't, such as if it was a graph rather
   than a list and there were two distinct middle nodes B1 and B2 for
   each direction... it still wouldn't help. P reads and writes A
   before Q does, and so P comes before Q; but P reads and writes C
   after Q does, so P also comes after Q. There is no way to reconcile
   those propositions. :-)

To come up with an example where releasing read locks early causes a
consistency problem we have to contrive a little harder.

   Suppose you have the same doubly-linked list, but this time each
   node just has a "mark", and you want to be able to search the list
   for the marked node and return it. If no node is marked, mark one
   and return that. And you want for some reason to be able to do this
   from either end. The naive implementation goes like this:

        forwards:
        for (x = A; x; x = x->next) {
                lock(x);
                if (x->next == NULL) {
                        x->marked = 1;
                }
                if (x->marked) {
                        unlock(x);
                        break;
                }
                unlock(x);
        }
        return x;

        backwards:
        for (x = C; x; x = x->prev) {
                lock(x);
                if (x->prev == NULL) {
                        x->marked = 1;
                }
                if (x->marked) {
                        unlock(x);
                        break;
                }
                unlock(x);
        }
        return x;

   If you run *these* together, in processes P and Q, with no marked
   node in the list, you (can) get this:

        P               Q
        ------------------------
        lock(A);        lock(C);
        not marked      not marked
        unlock(A);      unlock(C);
        lock(B);
        not marked
        unlock(B);
                        lock(B);
                        not marked
                        unlock(B);
        lock(C);        lock(A);
        not marked      not marked
        C.marked = 1;   A.marked = 1;
        unlock(C);      unlock(A);
        return C;       return A;

   Now two nodes in the list are marked, which is not equivalent to
   any serial execution.

   Again, lock coupling will deadlock, but even if it didn't, it
   wouldn't help.

   (Note that in both cases I'm assuming the list structure is static,
   so as to avoid worrying about locking the next/prev pointers.)

I think it likely that a similar example can be cooked up for vnodes,
but at the moment I don't feel inclined to prepare it. Consider it an
exercise for the reader. :-)

 > So, I've probably misunderstood something....

Nah, just the context... that's a pretty good counterexample,
actually. It's just one for something I wasn't trying to prove. :-)

-- 
David A. Holland
dholland%netbsd.org@localhost


Home | Main Index | Thread Index | Old Index