Subject: Re: [Fwd: Re: SOFTDEPS safe for qmail?]
To: Robert Elz <kre@munnari.OZ.AU>
From: Kirk McKusick <mckusick@mckusick.com>
List: current-users
Date: 06/17/2000 09:47:39
Ethan Solomita forwarded this message to me and asked that I comment.

	-------- Original Message --------
	From: Robert Elz <kre@munnari.OZ.AU>
	To: sommerfeld@orchard.arlington.ma.us
	Cc: current-users@netbsd.org
	Subject: Re: SOFTDEPS safe for qmail?
	Date: Sat, 17 Jun 2000 05:45:32 +1000

	    Date:        Fri, 16 Jun 2000 09:30:06 -0400
	    From:        Bill Sommerfeld <sommerfeld@orchard.arlington.ma.us>
	    Message-ID:  <200006161330.NAA21374@orchard.arlington.ma.us>

	  | What recent versions of sendmail do is:
	  | 
	  | 	write message to file
	  | 	fsync file
	  | 	rename file (to indicate that the file is a complete message)
	  | 	fsync file
	  | 
	  | If you add the second fsync to force the rename out to disk, you
	  | should be all set..

	Is that really guaranteed?   rename() is an operation on the directory,
	not the file - the only operation it performs on the file (the inode
	of the file) is to update the inode changed time (and that really only
	for historical reasons). I can't think of any particular reason that a
	filesystem which is attempting to maximise effeciency, while retaining
	internal consistency, would care much when the inode change time update
	was done with respect to the directory changes that are going on.  If
	the inode is flushed before the rename finishes (before the updated
	directory is flushed) then the problem would still be there.

	If the rename was being done by a link/unlink combination, which was
	actually altering the link count in the inode, then I'd tend to trust
	it more (as the inode count can't be decremented after the unlink
	until after the directory has actually been updated).

	That being said, I'd be a little surprised if Kirk hadn't considered
	the needs of sendmail in the design of all of this...

	kre

The fsync system call guarantees that the contents of a file and any
and all names that reference it as well as all directories above those
names to the root of the filesystem have been sync'ed to disk. The
rename system call creates a dependency describing the new name of the
inode. When the file is fsync'ed, the second time in the example above,
that creation dependency will be cleared ensuring that the name exists
on the disk. If you are replacing an existing file, the actual disk
operation that will occur is that the directory entry with the name
does not change, you are just updating the the inode number associated
with that entry. Thus, the second fsync above really just makes sure
that that inode number update has happened on the disk. The reason to
do two fsync's is to ensure that you do not end up with the name pointing
to a partially written file which could happen if you did only the
second of the two fsync's above. The ordering of the writes during
the fsync are not guarenteed, thus the name might be written before
all the data blocks finished. In my soft updates implementation, the
file data is always finished before the name is committed, so only
the second fsync is really needed. However, I do not want to force all
future async implementors to make that promise, so I tell folks to
do the two fsync's as described above. That should ensure correct
semantics for any properly written async scheme.

	Kirk McKusick