Subject: kern/33861: NFS rename race condition data loss
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <jld@panix.com>
List: netbsd-bugs
Date: 06/29/2006 05:30:00
>Number:         33861
>Category:       kern
>Synopsis:       NFS rename race condition; impact is the loss of the file being renamed.
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Jun 29 05:30:00 +0000 2006
>Originator:     Jed Davis
>Release:        NetBSD 3.0
>Organization:
PANIX Public Access Internet and UNIX, NYC
>Environment:
System: NetBSD panix5.panix.com 3.0 NetBSD 3.0 (PANIX-FIVE) #0: Fri Apr 14 21:05:29 EDT 2006 root@juggler.panix.com:/devel/netbsd/3.0/src/sys/arch/i386/compile/PANIX-FIVE i386
Architecture: i386
Machine: i386
>Description:

In nfs_rename(), if the destination appears to exist and is "in use"
(this check is apparently satisfied even if the file isn't in use by
anything except the rename itself), it will sillyrename it, then delete
the sillyrenamed file even if the rename fails -- for instance, because
the "from" file no longer exists on the server.

More concretely, this can occur when two rename()s with the same
arguments on the same client overlap in their execution; the first one
does the rename and succeeds, while the second one sees the destination
as existing and deletes it as above.

More detail: the key to the race seems to be that the second renamer
finish namei'ing the "from" path in rename_files() before the first one
has flushed the lookup cache in nfs_rename().

>How-To-Repeat:

mkdir a b; touch a/x; perl -e 'fork(); rename("a/x","b/x") or die "$!\n"'

Afterwards, neither a/x nor b/x will exist.  This appears to require
an NFS server with a certain amount of load on it, perhaps so that
responses take long enough to trigger the race properly.  We have such,
and can reproduce this with, so far, 100% success.  I'm certain it's not
a server issue, as I have a tcpdump trace where the client does:

1) Lookup of b/x; fails with NOENT.
2) Rename from a/x to b/x; succeeds.
3) Lookup of b/x; fails with NOENT.
4) Rename from b/x to b/.nfsA23a3; succeeds.
5) Rename from a/x to b/x; fails with NOENT.
6) Remove of b/.nfsA23a3; succeeds.

Alternately: if two instances of Courier-IMAP both check for new mail in
the same maildir at the same time, and find some, they'll both try to
rename it into the "cur" directory at the same time, and the mail will
be lost.  (This, however, appears to depend much more on luck.)

>Fix:

To fix just this case, it seems to be enough that the last-component
lookup on the "from" path be done after the "to" path has been looked up
(and the "to" directory's vnode lock has been taken); the least painful
way to do this seems to be to have nfs_rename re-do that LOOKUP before
going ahead with the sillyrename.  (For the case of multiple clients,
it'll have to bypass any caches.)

But that won't help for, say, rename("a/x","b/x") racing with
rename("a/x","c/x"), where all of {a,b,c}/x exist beforehand.  Or, on
separate clients, two rename("a/x","b/x")s where both files exist.

Ideally, the sillyrenaming would be rolled back if the rename fails.