Source-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[src/trunk]: src genfs_rename: Fix deadlocks in cross-directory cyclic rename.



details:   https://anonhg.NetBSD.org/src/rev/52d4c0b100aa
branches:  trunk
changeset: 938237:52d4c0b100aa
user:      riastradh <riastradh%NetBSD.org@localhost>
date:      Sat Sep 05 02:47:03 2020 +0000

description:
genfs_rename: Fix deadlocks in cross-directory cyclic rename.

Reproducer:

A: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
    rmdir("c/d/e"); rmdir("c/d"); }
B: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
    rename("c", "c/d/e"); }
C: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
    rename("c/d/e", "c"); }

Deadlock:

- A holds c and wants to lock d; and either
- B holds . and d and wants to lock c, or
- C holds . and d and wants to lock c.

The problem with these is that genfs_rename_enter_separate in B or C
tried lock order .->d->c->e (in A/B, fdvp->tdvp->fvp->tvp; in A/C,
tdvp->fdvp->tvp->fvp) which violates the ancestor->descendant order
.->c->d->e.

The resolution is to change B to do fdvp->fvp->tdvp->tvp and C to do
tdvp->tvp->fdvp->fvp.  But there's an edge case: tvp and fvp might be
the same (hard links), and we can't detect that until after we've
looked them both up -- and in some file systems (I'm looking at you,
ufs), there is no mere lookup operation, only lookup-and-lock, so we
can't even hold the lock on one of tvp or fvp when we look up the
other one if there's a chance they might be the same.

Fortunately the cases
(a) tvp = fvp
(b) tvp or fvp is a directory
are mutually exclusive as long as directories cannot be hard-linked.
In case (a) we can just defer locking {tvp, fvp} until the end, because
it can't possibly have {fdvp or fvp, tdvp or tvp} as descendants.  In
case (b) we can just lock them in the order fdvp->fvp->tdvp->tvp or
tdvp->tvp->fdvp->fvp if the first one of {fvp, tvp} is a directory,
because it can't possibly coincide with the second one of {fvp, tvp}.

With this change, we can now prove that the locking order is consistent
with the ancestor->descendant partial ordering.  Where two nodes are
incommensurate under that partial ordering, they are only ever locked
by rename and there is only ever one rename at a time.

Proof:

- For same-directory renames, genfs_rename_enter_common locks the
  directory first and then the children.  The order
  directory->child[i] is consistent with ancestor->descendant and
  child[0]/child[1] are incommensurate.

- For cross-directory renames:

  . While a rename is in progress and the fs-wide rename lock is held,
    directories can be created or removed but not changed, so the
    outcome of gro_genealogy -- which, given fdvp and tdvp, returns
    the node N relating fdvp/N/.../tdvp or null if there is none --
    can only transition from finding N to not finding N, if one of
    the directories is removed while any of the vnodes are unlocked.
    Merely creating directories cannot change the ancestry of tdvp,
    and concurrent renames are not possible.

    Thus, if a gro_genealogy determined the operation to have the
    form fdvp/N/.../tdvp, then it might cease to have that form, but
    only because tdvp was removed which will harmlessly cause the
    rename to fail later on.  Similarly, if gro_genealogy determined
    the operation _not_ to have the form fdvp/N/.../tdvp then it
    can't begin to have that form until after the rename has
    completed.

    The lock order is,

    => for fdvp/.../tdvp:
       1. lock fdvp
       2. lookup(/lock/unlock) fvp (consistent with fdvp->fvp)
       3. lock fvp if a directory (consistent with fdvp->fvp)
       4. lock tdvp (consistent with fdvp->tdvp and possibly fvp->tdvp)
       5. lookup(/lock/unlock) tvp (consistent with tdvp->tvp)
       6. lock fvp if a nondirectory (fvp->t* or fvp->fdvp is impossible)
       7. lock tvp if not fvp (tvp->f* is impossible unless tvp=fvp)

    => for incommensurate fdvp & tdvp, or for tdvp/.../fdvp:
       1. lock tdvp
       2. lookup(/lock/unlock) tvp (consistent with tdvp->tvp)
       3. lock tvp if a directory (consistent with tdvp->tvp)
       4. lock fdvp (either incommensurate with tdvp and/or tvp, or
          consistent with tdvp(->tvp)->fdvp)
       5. lookup(/lock/unlock) fvp (consistent with fdvp->fvp)
       6. lock tvp if a nondirectory (tvp->f* or tvp->tdvp is impossible)
       7. lock fvp if not tvp (fvp->t* is impossible unless fvp=tvp)

Deadlocks found by hannken@; resolution worked out with dholland@.

XXX I think we could improve concurrency somewhat -- with a likely
big win for applications like tar and rsync that create many files
with temporary names and then rename them to the permanent one in the
same directory -- by making vfs_renamelock a reader/writer lock: any
number of same-directory renames, or exactly one cross-directory
rename, at any one time.

diffstat:

 sys/miscfs/genfs/genfs_rename.c |  113 +++++++++++++++++++++++++++++++--------
 tests/fs/vfs/t_renamerace.c     |   10 +---
 2 files changed, 89 insertions(+), 34 deletions(-)

diffs (234 lines):

diff -r c2c39a9e8d0f -r 52d4c0b100aa sys/miscfs/genfs/genfs_rename.c
--- a/sys/miscfs/genfs/genfs_rename.c   Sat Sep 05 02:45:22 2020 +0000
+++ b/sys/miscfs/genfs/genfs_rename.c   Sat Sep 05 02:47:03 2020 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: genfs_rename.c,v 1.4 2020/05/16 18:31:51 christos Exp $        */
+/*     $NetBSD: genfs_rename.c,v 1.5 2020/09/05 02:47:03 riastradh Exp $       */
 
 /*-
  * Copyright (c) 2012 The NetBSD Foundation, Inc.
@@ -37,7 +37,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: genfs_rename.c,v 1.4 2020/05/16 18:31:51 christos Exp $");
+__KERNEL_RCSID(0, "$NetBSD: genfs_rename.c,v 1.5 2020/09/05 02:47:03 riastradh Exp $");
 
 #include <sys/param.h>
 #include <sys/kauth.h>
@@ -713,12 +713,26 @@
 }
 
 /*
- * genfs_rename_lock: Lock directories a and b, which must be distinct,
- * and look up and lock nodes a and b.  Do a first and then b.
- * Directory b may not be an ancestor of directory a, although
- * directory a may be an ancestor of directory b.  Fail with
- * overlap_error if node a is directory b.  Neither componentname may
- * be `.' or `..'.
+ * genfs_rename_lock: Lookup and lock it all.  The lock order is:
+ *
+ *     a_dvp -> a_vp -> b_dvp -> b_vp,
+ *
+ * except if a_vp is a nondirectory in which case the lock order is:
+ *
+ *     a_dvp -> b_dvp -> b_vp -> a_vp,
+ *
+ * which can't violate ancestor->descendant because a_vp has no
+ * descendants in this case.  This edge case is necessary because some
+ * file systems can only lookup/lock/unlock, and we can't hold a_vp
+ * locked when we lookup/lock/unlock b_vp if they turn out to be the
+ * same, and we can't find out that they're the same until after the
+ * lookup.
+ *
+ * b_dvp must not be an ancestor of a_dvp, although a_dvp may be an
+ * ancestor of b_dvp.
+ *
+ * Fail with overlap_error if node a is directory b.  Neither
+ * componentname may be `.' or `..'.
  *
  * a_dvp and b_dvp must be referenced.
  *
@@ -766,6 +780,9 @@
        KASSERT(b_dvp->v_mount == mp);
        KASSERT(a_missing_ok != b_missing_ok);
 
+       /*
+        * 1. Lock a_dvp.
+        */
        error = ops->gro_lock_directory(mp, a_dvp);
        if (error)
                goto fail0;
@@ -776,6 +793,9 @@
                goto fail1;
        }
 
+       /*
+        * 2. Lookup a_vp.  May lock/unlock a_vp.
+        */
        error = ops->gro_lookup(mp, a_dvp, a_cnp, a_de_ret, &a_vp);
        if (error) {
                if (a_missing_ok && (error == ENOENT))
@@ -791,6 +811,7 @@
                        goto fail2;
                }
 
+               /* Reject rename("x", "x/y") or rename("x/y", "x").  */
                if (a_vp == b_dvp) {
                        error = overlap_error;
                        goto fail2;
@@ -800,32 +821,67 @@
        KASSERT(a_vp != a_dvp);
        KASSERT(a_vp != b_dvp);
 
+       /*
+        * 3. Lock a_vp, if it is a directory.
+        *
+        * We already ruled out a_vp == a_dvp (i.e., a_cnp is `.'), so
+        * this is not locking against self, and we already ruled out
+        * a_vp == b_dvp, so this won't cause subsequent locking of
+        * b_dvp to lock against self.
+        *
+        * If a_vp is a nondirectory, we can't hold it when we lookup
+        * b_vp in case (a) the file system can only lookup/lock/unlock
+        * and (b) b_vp turns out to be the same file as a_vp due to
+        * hard links -- and we can't even detect that case until after
+        * we've looked up b_vp.  Fortunately, if a_vp is a
+        * nondirectory, then it is a leaf, so we can safely lock it
+        * last.
+        */
+       if (a_vp != NULL && a_vp->v_type == VDIR) {
+               vn_lock(a_vp, LK_EXCLUSIVE | LK_RETRY);
+               KASSERT(a_vp->v_mount == mp);
+               /* Refuse to rename (over) a mount point.  */
+               if (a_vp->v_mountedhere != NULL) {
+                       error = EBUSY;
+                       goto fail3;
+               }
+       }
+
+       /*
+        * 4. Lock b_dvp.
+        */
        error = ops->gro_lock_directory(mp, b_dvp);
        if (error)
-               goto fail2;
+               goto fail3;
 
        /* Did we lose a race with mount?  */
        if (b_dvp->v_mountedhere != NULL) {
                error = EBUSY;
-               goto fail3;
+               goto fail4;
        }
 
+       /*
+        * 5. Lookup b_vp.  May lock/unlock b_vp.
+        */
        error = ops->gro_lookup(mp, b_dvp, b_cnp, b_de_ret, &b_vp);
        if (error) {
                if (b_missing_ok && (error == ENOENT))
                        b_vp = NULL;
                else
-                       goto fail3;
+                       goto fail4;
        } else {
                KASSERT(b_vp != NULL);
 
                /* Refuse to rename (over) `.'.  */
                if (b_vp == b_dvp) {
                        error = b_dot_error;
-                       goto fail4;
+                       goto fail5;
                }
 
-               /* b is not an ancestor of a.  */
+               /*
+                * b_dvp must not be an ancestor of a_dvp, so if we
+                * find b_dvp/b_vp=a_dvp/a_vp something is wrong.
+                */
                if (b_vp == a_dvp) {
                        /*
                         * We have a directory hard link before us.
@@ -833,26 +889,31 @@
                         * Panic?
                         */
                        error = EIO;
-                       goto fail4;
+                       goto fail5;
                }
        }
        KASSERT(b_vp != b_dvp);
        KASSERT(b_vp != a_dvp);
 
        /*
-        * We've looked up both nodes.  Now lock them and check them.
+        * 6. Lock a_vp, if it is a nondirectory.
+        *
+        * In this case a_vp is a leaf, so it is either equal to or
+        * incommensurate with b_vp, and so we can safely lock it at
+        * any point now.
         */
-
-       if (a_vp != NULL) {
+       if (a_vp != NULL && a_vp->v_type != VDIR) {
                vn_lock(a_vp, LK_EXCLUSIVE | LK_RETRY);
                KASSERT(a_vp->v_mount == mp);
-               /* Refuse to rename (over) a mount point.  */
-               if ((a_vp->v_type == VDIR) && (a_vp->v_mountedhere != NULL)) {
-                       error = EBUSY;
-                       goto fail5;
-               }
+               /* (not a directory so can't have anything mounted here) */
        }
 
+       /*
+        * 7. Lock b_vp, if it is not a_vp.
+        *
+        * b_vp and a_vp may the same inode if they are hard links to
+        * one another.
+        */
        if ((b_vp != NULL) && (b_vp != a_vp)) {
                vn_lock(b_vp, LK_EXCLUSIVE | LK_RETRY);
                KASSERT(b_vp->v_mount == mp);
@@ -876,11 +937,13 @@
 
 fail6: if ((b_vp != NULL) && (b_vp != a_vp))
                VOP_UNLOCK(b_vp);
-fail5: if (a_vp != NULL)
+       if (a_vp != NULL && a_vp->v_type != VDIR)
                VOP_UNLOCK(a_vp);
-fail4: if (b_vp != NULL)
+fail5: if (b_vp != NULL)
                vrele(b_vp);
-fail3: VOP_UNLOCK(b_dvp);
+fail4: VOP_UNLOCK(b_dvp);
+fail3: if (a_vp != NULL && a_vp->v_type == VDIR)
+               VOP_UNLOCK(a_vp);
 fail2: if (a_vp != NULL)
                vrele(a_vp);
 fail1: VOP_UNLOCK(a_dvp);
diff -r c2c39a9e8d0f -r 52d4c0b100aa tests/fs/vfs/t_renamerace.c
--- a/tests/fs/vfs/t_renamerace.c       Sat Sep 05 02:45:22 2020 +0000
+++ b/tests/fs/vfs/t_renamerace.c       Sat Sep 05 02:47:03 2020 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: t_renamerace.c,v 1.37 2020/09/05 02:45:22 riastradh Exp $      */
+/*     $NetBSD: t_renamerace.c,v 1.38 2020/09/05 02:47:03 riastradh Exp $      */
 
 /*
  * Modified for rump and atf from a program supplied
@@ -270,14 +270,6 @@
        sleep(10);
        quittingtime = 1;
 
-       if (FSTYPE_EXT2FS(tc) ||
-           FSTYPE_FFS(tc) ||
-           FSTYPE_FFSLOG(tc) ||
-           FSTYPE_LFS(tc) ||
-           FSTYPE_TMPFS(tc) ||
-           FSTYPE_UDF(tc))
-               atf_tc_expect_signal(SIGALRM, "genfs_rename is busted");
-
        alarm(1);
        pthread_join(pt_rmdir, NULL);
        pthread_join(pt_rename1, NULL);



Home | Main Index | Thread Index | Old Index