netbsd-bugs: link() apparently nonatomic

Subject: link() apparently nonatomic
To: None <netbsd-bugs@NetBSD.ORG>
From: Liz Stokes <ilaine@panix.com>
List: netbsd-bugs
Date: 04/17/1996 19:52:01
	We are using a link() based locking routine for use over NFS on our
web servers running NetBSD . This routine has been failing with multiple
processes gaining the lock simultaniously. The bug is fairly easily
reproducible via a perl cgi script under the web server. It is somewhat more
difficult to reproduce running the perl script while logged into the
same machine. I have not been able to reproduce it running a c program
which does essentially the same thing as the perl script.  These tests were
done on a local volume, not over NFS. Kernel is version 1.1A

	When the failure occurs, it is always 2 (or sometimes 3) processes
claiming to have gotten the lock at the same instant, not one coming along
later and falsly claiming to have gotten the lock. This and the fact that
the bug comes out under the heavy load of the extra perl processes, and
more so with the httpd processes on top of that imply that lock() is
subject to a race condition.

	I have appended the nfslock() and nfsunlock() routines with
comments, and the perl script. I haven't included the driver routine the
perl script calls (all it does is call the library routines and return the
result) nor the misc other routines in the nfslock library for brevity's
sake.

Liz Stokes
Public Access Networks

/*
 *
 * function: nfslock
 * original author: Chuck Cranor <chuck@maria.wustl.edu>
 * rewritten by: Alexis Rosen <alexis@panix.com>
 *
 * (Excerpts from Chuck's notes:
 *   this becomes complex, due to our dear friend, the NFS mounted mail spool.
 *   the netbsd code didn't do this properly, as far as I could tell.
 *
 *   - you can't trust exclusive creating opens over NFS, the protocol
 *   just doesn't support it.   so to do a lock you have to create
 *   a tmp file and then try and hard link it to your lock file.
 *   - to detect a stale lock file you have to see how old it is, but
 *   you can't use time(0) because that is the time on the local system
 *   and the file gets the times of the NFS server.  when is a lock
 *   file stale?  people seem to like 120 or 300 seconds.)
 *
 * NB: It is _critical_ that nfslock()ed files be unlocked by nfsunlock().
 * Simply unlinking the lock file is a good way to trash someone else's lock
 * file. All it takes is for the process doing the unlink to get hung for
 * a few minutes when it doesn't expect it. Meanwhile, its lock expires and
 * a second process forces the lock and creates its own. Then the first
 * process comes along and kills the second process' lock while it's still
 * valid.
 *
 * Security considerations:
 * If we're root, be very careful to see that the temp file we opened is
 * what we think it is. The problem is that we could lose a race with
 * someone who takes our tmp file and replaces it with, say, a hard
 * link to /etc/passwd. Then, if the first lock attempt fails, we'll
 * write a char to the file (see 4. below); this would truncate the
 * passwd file. So we make sure that the link count is 1. We don't really
 * care about any other screwing around since we don't write anything
 * sensitive to the lock file, nor do we change its owner or mode. If
 * someone beats us on a race and replaces our temp file with anything
 * else, it's no big deal- the file may get truncated, but there's no
 * possible security breach. ...Actually the possibility of the race
 * ever happening, given the random name of the file, is virtually nil.
 *
 * args: path = path to directory of lock file (/net/u/1/a/alexis/.mailspool)
 *       namelock = file name of lock file (alexis.lock)
 *	 max_age = age of lockfile, in seconds, after which the lock is stale.
 *		stale locks are always broken. Defaults to DEFAULT_LOCKTIME
 *		if zero. Panix mail locks go stale at 300 seconds, the default.
 *       notify = 1 if we should tell stdout that we're sleeping on a lock
 *
 * Returns the time that the lock was created on the other system. This is
 * important for nfsunlock(). If the lock already exists, returns NFSL_LOCKED.
 * If there is some other failure, return NFSL_SYSF. If NFSL_LOCKED is
 * returned, errno is also set to EEXIST. If we're root and the link count
 * on the tmp file is wrong, return NFSL_SECV.
 *
 * Mods of 7/13/95: Change a bit of code to re-stat the lockfile after
 * closing it. This is to work around a bug in SunOS that appears to to affect
 * some SunOS 4.1.3 machines (but not all). The bug is that close() updates
 * the stat st_ctime field for that file. So use lstat on fullpath instead
 * of fstat on tmpfd. This alteration applies to both nfslock and nfslock1.
 *
 * Mod of 5/4/95: Change printf's to fprintf(stderr... in nfslock and nfslock1.
 *
 * Mods of 4/29/95: Fix freeing memory before use if a stat fails. Remove
 * code that forbids running as root; instead, if root, check link count on
 * tmp file after opening it.
 *
 * Mods of 4/27/95: Return the create time instead of the lockfile's fd, which
 * is useless. Added new routines nfsunlock(), nfslock_test(), nfslock_renew().
 *
 * Mods of 1/8/95: Eliminate some security checks since this code never
 * runs as root. In particular, we completely eliminate the safeopen
 * routine. But add one check: if we _are_ root, fail immediately.
 *
 * Change arguments: take a path and a filename. Don't assume a global or
 * macro pointing to a mailspool.
 *
 * Add notify argument; if 1, tell user when we're waiting for a lock.
 *
 * Add max_age argument and DEFAULT_LOCKTIME.
 *
 * Change comments drastically.
 *
 */

time_t nfslock(path, namelock, max_age, notify)

char *path, *namelock;
int max_age, notify;

{
  int tries = 0, oldlck = 0, tmpfd;
  struct stat old_stat, our_tmp;
  char tmp[32];
  char *tpath, *fullpath;

  srandom(getpid()+time(0));
  max_age = max_age ? max_age : DEFAULT_LOCKTIME;

  /*
   * 1. create a tmp file with a psuedo random file name. we also make
   *    tpath which is a buffer to store the full pathname of the tmp file.
   */

  sprintf(tmp,"slock%d.%d", getpid(), random());

  tpath = malloc(strlen(path) + 1 + strlen(tmp) + 1);
  if (tpath == NULL) return(NFSL_SYSF);
  sprintf(tpath,"%s/%s", path, tmp);

  tmpfd = open(tpath, O_CREAT|O_WRONLY|O_EXCL, S_IRUSR|S_IWUSR);

  if (tmpfd < 0) {	/* open failed */
    close(tmpfd);
    unlink(tpath);
    free(tpath);
    return(NFSL_SYSF);
  }

  if (getuid()==0) {	/* we're root, so be careful! */
    if (fstat(tmpfd, &our_tmp) < 0) {	/* stat failed... shouldn't happen */
      close(tmpfd);
      unlink(tpath);
      free(tpath);
      return(NFSL_SYSF);
    }
    if (our_tmp.st_nlink != 1) {	/* someone's trying to mess with us */
      fprintf(stderr,"nfslock: bad link count on %s\n",tpath);
      close(tmpfd);
      unlink(tpath);
      free(tpath);
      return(NFSL_SECV);
    }
  }

  /*
   * 2. make fullpath, a buffer for the full pathname of the lock file.
   *    then start looping trying to lock it
   */

  fullpath = malloc(strlen(path) + 1 + strlen(namelock) + 1);
  if (fullpath == NULL) {
    close(tmpfd);
    unlink(tpath);
    free(tpath);
    return(NFSL_SYSF);
  }
  sprintf(fullpath,"%s/%s", path, namelock);

  while (tries < 10) {

    /*
     * 3. link tmp file to lock file.  if it goes, we win and we clean
     *    up and return the st_ctime of the lock file.
     */

    if (link(tpath, fullpath) == 0) {
      unlink(tpath); /* got it! */
      free(tpath);
      close(tmpfd);
      if (lstat(fullpath, &our_tmp) < 0) {	/* stat failed... shouldn't happen */
	unlink(fullpath);
        free(fullpath);
	return (NFSL_SYSF);
      }
      free(fullpath);
      return(our_tmp.st_ctime);
    }

    /*
     * 4. the lock failed.  check for a stale lock file, being mindful
     *    of NFS and the fact the time is set from the NFS server.  we
     *    do a write on the tmp file to update its time to the server's
     *    idea of "now."
     */

    oldlck = lstat(fullpath, &old_stat);

    if (write(tmpfd, "\0", 1) != 1 || fstat(tmpfd, &our_tmp) < 0)
      break; /* something bogus is going on */

    if (oldlck != -1 && old_stat.st_ctime + max_age < our_tmp.st_ctime) {
      unlink(fullpath); /* break the stale lock */
      if (notify) fprintf(stderr,"Breaking stale lock on %s.\n",fullpath);
      tries++;
      sleep(1+(random() % 10));	/* It is CRITICAL that we sleep after breaking
				 * the lock. Otherwise, we could race with
				 * another process and unlink it's newly-
				 * created file.
				 */
      continue;
    }

    /*
     * 5. try again
     */

    if (notify) {
      if (tries==0) fprintf(stderr,"Waiting for lock on file %s.\n",fullpath);
      else fprintf(stderr,"Still waiting...\n");
    }
    tries++;
    sleep(1+(random() % 10));
  }

  /*
   * 6. give up, failure.
   */

  errno = EEXIST;
  unlink(tpath);
  free(tpath);
  free(fullpath);
  close(tmpfd);
  return(NFSL_LOCKED);
}

/*
 * function: nfsunlock
 * author: Alexis Rosen <alexis@panix.com>
 *
 * Unlock an nfslock()ed file.
 *
 * This can get tricky because the lock may have expired (perhaps even
 * during a process that should be "atomic"). We have to make sure we don't
 * unlock some other process' lock, and return a panic code if we think our
 * lock file has been broken illegally. What's done in reaction to that panic
 * (of anything) is up to the caller. See the comments on nfslock()!
 *
 * args: path = path to directory of lock file (/net/u/1/a/alexis/.mailspool)
 *       namelock = file name of lock file (alexis.lock)
 *       max_age = age of lockfile, in seconds, after which the lock is stale.
 *		stale locks are always broken. Defaults to DEFAULT_LOCKTIME
 *		if zero. Panix mail locks go stale at 300 seconds, the default.
 *	 birth = time the lock was created (as returned by nfslock()).
 *
 * Returns NFSL_OK if successful, NFSL_LOST if the lock has been lost
 * legitimately (because more than max_age has passed since the lock was
 * created), and NFSL_STOLEN if it's been tampered with illegally (i.e.
 * while this program is within the expiry period). Returns NFSL_SYSF if
 * another system failure prevents it from even trying to unlock the file.
 *
 * Note that for many programs, a return code of NFSL_LOST or NFSL_STOLEN is
 * equally disastrous; a NFSL_STOLEN means that some other program may have
 * trashed your file, but a NFSL_LOST may mean that _you_ have trashed someone
 * else's file (if in fact you wrote the file that you locked after you lost
 * the lock) or that you read inconsistent information.
 *
 * In practice, a return code of NFSL_LOST or NFSL_STOLEN will virtually never
 * happen unless someone is violating the locking protocol.
 *
 */

int nfsunlock(path, namelock, max_age, birth)

char *path, *namelock;
int max_age;
time_t birth;

{
  int tmpfd;
  struct stat old_stat, our_tmp;
  char *tpath, *fullpath;

  srandom(getpid()+time(0));
  max_age = max_age ? max_age : DEFAULT_LOCKTIME;

  /*
   * 1. Build a temp file and stat that to get an idea of what the server
   *    thinks the current time is (our_tmp.st_ctime)..
   */

  tpath = malloc(strlen(path) + 25);	/* small slop factor- 23 s/b enough */
  if (tpath == NULL) return(NFSL_SYSF);
  sprintf(tpath,"%s/slock%d.%d", path, getpid(), random());

  tmpfd = open(tpath, O_CREAT|O_WRONLY|O_EXCL, S_IRUSR|S_IWUSR);

  if ((tmpfd < 0) || (write(tmpfd, "\0", 1) != 1)
		  || (fstat(tmpfd, &our_tmp) < 0)) {
    /* The open failed, or we can't write the file, or we can't stat it */
    close(tmpfd);
    unlink(tpath);
    free(tpath);
    return(NFSL_SYSF);
  }

  close(tmpfd);		/* We don't need this once we have our_tmp.st_ctime. */
  unlink(tpath);
  free(tpath);

  /*
   * 2. make fullpath, a buffer for the full pathname of the lock file
   */

  fullpath = malloc(strlen(path) + 1 + strlen(namelock) + 1);
  if (fullpath == NULL)
    return(NFSL_SYSF);
  sprintf(fullpath,"%s/%s", path, namelock);

  /*
   * 3. If the ctime hasn't been modified, unlink the file and return. If the
   *    lock has expired, sleep the usual random interval before returning.
   *    If we didn't sleep, there could be a race if the caller immediately
   *    tries to relock the file.
   */

  if ( !lstat(fullpath, &old_stat) &&	/* stat succeeds so file is there */
      (old_stat.st_ctime == birth)) {	/* hasn't been modified since birth */
    unlink(fullpath);			/* so the lock is ours to remove */
    if (our_tmp.st_ctime >= birth + max_age)	/* the lock has expired */
      sleep(1+(random() % 10));		/* so sleep a bit */
    free(fullpath);
    return(NFSL_OK);			/* success */
  }

  free(fullpath); 	/* we don't use fullpath anymore */

  /*
   * 4. Either ctime has been modified, or the entire lock file is missing.
   *    If the lock should still be ours, based on the ctime of the temp
   *    file, return with NFSL_STOLEN. If not, then our lock is expired and
   *    someone else has grabbed the file, so return NFSL_LOST.
   */

  if (our_tmp.st_ctime < birth + max_age)	/* lock was stolen */
    return(NFSL_STOLEN);

  return(NFSL_LOST);	/* The lock must have expired first. */
}


/*
 * function: nfslock_test
 * author: Alexis Rosen <alexis@panix.com>
 *
 * Test a lock to see if it's still valid.
 *
 * See the comments in nfsunlock() about lost and stolen locks.
 *
 * Args, return codes, and behavior are identical to nfsunlock except
 * that nfslock_test doesn't remove the lock. NFSL_OK means the lock is
 * good, NFLS_LOST and NFSL_STOLEN means it's bad, and NFSL_SYSF means
 * we couldn't tell due to system failure.
 *
 * The source for this routine is almost identical to nfsunlock(), but it's
 * coded separately to make things as clear as possible.
 */

int nfslock_test(path, namelock, max_age, birth)

char *path, *namelock;
int max_age;
time_t birth;

{
  int tmpfd;
  struct stat old_stat, our_tmp;
  char *tpath, *fullpath;

  srandom(getpid()+time(0));
  max_age = max_age ? max_age : DEFAULT_LOCKTIME;

  /*
   * 1. Build a temp file and stat that to get an idea of what the server
   *    thinks the current time is (our_tmp.st_ctime)..
   */

  tpath = malloc(strlen(path) + 25);	/* small slop factor- 23 s/b enough */
  if (tpath == NULL) return(NFSL_SYSF);
  sprintf(tpath,"%s/slock%d.%d", path, getpid(), random());

  tmpfd = open(tpath, O_CREAT|O_WRONLY|O_EXCL, S_IRUSR|S_IWUSR);

  if ((tmpfd < 0) || (write(tmpfd, "\0", 1) != 1)
		  || (fstat(tmpfd, &our_tmp) < 0)) {
    /* The open failed, or we can't write the file, or we can't stat it */
    close(tmpfd);
    unlink(tpath);
    free(tpath);
    return(NFSL_SYSF);
  }

  close(tmpfd);		/* We don't need this once we have our_tmp.st_ctime. */
  unlink(tpath);
  free(tpath);

  /*
   * 2. make fullpath, a buffer for the full pathname of the lock file
   */

  fullpath = malloc(strlen(path) + 1 + strlen(namelock) + 1);
  if (fullpath == NULL)
    return(NFSL_SYSF);
  sprintf(fullpath,"%s/%s", path, namelock);

  /*
   * 3. If the ctime hasn't been modified, check to see if the lock is
   *    expired. If expired, return NFSL_LOST. Otherwise, the lock is good,
   *    so return NFSL_OK.
   */

  if ( !lstat(fullpath, &old_stat) &&	/* stat succeeds so file is there */
      (old_stat.st_ctime == birth)) {	/* hasn't been modified since birth */
    free(fullpath);
    if (our_tmp.st_ctime >= birth + max_age)	/* the lock has expired */
      return(NFSL_LOST);
    else
      return(NFSL_OK);			/* success */
  }

  free(fullpath); 	/* we don't use fullpath anymore */

  /*
   * 4. Either ctime has been modified, or the entire lock file is missing.
   *    If the lock should still be ours, based on the ctime of the temp
   *    file, return with NFSL_STOLEN. If not, then our lock is expired and
   *    someone else has grabbed the file, so return NFSL_LOST.
   */

  if (our_tmp.st_ctime < birth + max_age)	/* lock was stolen */
    return(NFSL_STOLEN);

  return(NFSL_LOST);	/* The lock must have expired first. */
}


/*
 * function: nfslock_renew
 * author: Alexis Rosen <alexis@panix.com>
 *
 * Renew a lock file, if it's still valid, so that it's valid for longer.
 *
 * This routine uses nfslock_test() to see if the lock file is eligible for
 * renewal. If the lock is expired, lost, or stolen, it can't be renewed.
 * However, there's always the possibility of a race condition, so it cheats
 * just a little bit to be safe: It pretends that the lock will expire 5
 * seconds before it actually does. This prevents the problem where it checks
 * the status of the lock just before it expires, gets an OK, and revises
 * the lock just after it expires, just as another process forces the lock.
 * 5 seconds is probably generous but it's a small enough fraction of the
 * typical lock's lifespan that it shouldn't make any difference.
 *
 * args: See nfslock_test()
 *
 * Return codes: Exactly like nfslock_test() with one major exception: If the
 * lock is valid, this routine will return a new "birth" time for that lock.
 */

time_t nfslock_renew(path, namelock, max_age, birth)

char *path, *namelock;
int max_age;
time_t birth;

{
  int retcode;
  struct stat our_lock;
  char *fullpath;

  /* Test the lock. Give it 5 seconds less than the usual expiry period */
  retcode = nfslock_test(path, namelock, max_age-5, birth);
  if (retcode != NFSL_OK)	/* lock is bad */
    return(retcode);

  /* The lock is valid, so renew it */

  /* Make fullpath, the buffer for the full pathname of the lock file */
  fullpath = malloc(strlen(path) + 1 + strlen(namelock) + 1);
  if (fullpath == NULL)
    return(NFSL_SYSF);
  sprintf(fullpath,"%s/%s", path, namelock);

  if (utimes(fullpath, NULL) == -1) {	/* update the time on the lock */
    free(fullpath);
    return(NFSL_SYSF);
  }

  if (lstat(fullpath, &our_lock) < 0) {	/* stat should never fail */
    free(fullpath);
    return(NFSL_SYSF);	/* but if it does fail ... */
  }

  free(fullpath);
  return(our_lock.st_ctime);	/* all OK, return new birth second */
}


------------------------------------------------------------------------
#! /usr/local/bin/perl

$root="/usr/local/net_public/httpd/";

$| = 1;

$lock_file_dir = '$root/htdocs/userdirs/ilaine/test2';
$lock_duration = 30;
if (1) {
    exit 0 unless fork() == 0;
}
print "pid $$ starting\n";
$lock_time = `$root/usr/local/bin/nfslock $lock_file_dir database.lock $lock_duration 0`;
chop $lock_time;
if ($lock_time eq "-1") { print "pid $$ giving up\n"; exit 0; }
sleep 3;
$done = time;
$unlock = `$root/usr/local/bin/nfsunlock $lock_file_dir database.lock $lock_duration $lock_time`;
chop ($unlock);
print ("pid $$ lock_time $lock_time unlock $unlock at $done\n");
exit 0;