Subject: Re: Extension of fsync_range() to permit forcing disk cache flushing
To: None <tech-kern@NetBSD.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 12/18/2004 00:56:01
>> [...] fsync()'s contract [...]

> What do you consider to be the "contract" in this case?  The man page
> alone, standards conformance, POLA?

All of those, perhaps with an emphasis on the manpage (because that's
where I'd expect our particular implementation to be documented).

The 2.0 manpage says that "fsync() causes all modified data and
attributes of fd to be moved to a permanent storage device".  The 1.4T
manpage has the same wording.  As far as I can recall, 4.3's fsync
manpage was similar in this regard.

> For instance, SUSE3 states pretty clearly in the rationale section of
> fsync that is _POSIX_SYNCHRONIZED_IO is not defined that it's pretty
> much up to the documentation to spell out what fsync can/cannot do.

Then from a standards conformance point of view, we can bring ours into
conformance by either fixing the code or documenting the actual
behaviour.

> It then goes on to spell out that an implementation such as ours
> (which can't guarentee absolutely due to caching) is conformant as
> long as we have a way to force it.

s/can't/isn't willing to/.  We _can_ defeat caches if we want to
(indeed, this whole discussion is about whether, when, and exactly how
to do so).

> i.e. The only bugs I really see in our implementation are not
> documenting fsync's non-guarentee if you're using a caching device
> and how to work around that (i.e. don't enable caching).

Strictly from a standards-conformance point of view, yes.

From the POV of conformance to traditional semantics (which is, I
suspect, largely what least astonishment amounts to in this case), we
really *ought* to push it clear to the platters, but can probably get
away with pushing it to the drive and adding a BUGS item noting that
the traditional semantics have long been silently broken and that they
can be restored, at a performance cost, by some action or other whose
nature is beyond the scope of this email. :-)

> Anyways, you'll still lose data on a power loss regardless of how
> many fsync calls you make just due to things like partially written
> sectors occuring.

Actually, I'm fairly sure I've heard of at least one disk drive that
uses the spindle motor as a generator to power the electronics long
enough to avoid problems like partially written sectors on power fail.

> Granted...people should remember that fsync only lives up to it's
> "contract" on a successful return.  I doubt under most power loss
> scenarios it's returning...

The problem is not losing power during the fsync.  It's losing power
shortly afterwards.  For example,

	Write data A.  fsync().
	Write data B.
	Continue working...powerfail.

Now, if you believe the semantics of the traditional description, it is
not possible that, after power is restored, data B is on the platters
but data A isn't.  If you just write to drive cache, this expectation
can be violated if the drive happens to have gotten around to pushing B
but not A to the platters,

That is, fsync functions like a write barrier.  (This is not its only
function.  An application could also do "write data, fsync, ack over
the network" and lose power just after sending the ack - and thus end
up incorrectly acking data that did not in fact get written, if the
fsync doesn't push the data past the cache.)

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B