Subject: Re: letting userland issue FUA writes
To: Bill Studenmund <wrstuden@netbsd.org>
From: Daniel Carosone <dan@geek.com.au>
List: tech-kern
Date: 03/16/2006 11:53:36
--QgvTbcZPsSS/HkXe
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Mar 15, 2006 at 07:28:49AM -0800, Bill Studenmund wrote:
> I'm not sure about O_DIRECT implying FUA.=20

I think they're distinct, as later discussion also seems to have
confirmed.

It's all about where caching is allowed to happen. If directio
prevents the kernel caching the disk data, it has to write it to disk
straight away (rather than defer the write). This isn't the same as
FUA - the disk may still be allowed to cache the write before it hits
the platters, or reorder the write with other outstanding requests, to
avoid undue performance impact.

> While I can think of apps that want that, I am not sure if if there
> are or aren't apps that will want O_DIRECT but not care about FUA.

Database servers that implement their own, smarter caching based on
higher-level structural knowledge would use directio, to avoid wasting
kernel memory double-caching disk pages.  Some realtime databases
might not bother at all about transactionality, but go all out for
speed and efficient resource utilisation.

Other database servers might want FUA set on specific writes for
transactional boundaries, but unset on other writes between boundaries
for performance.  That's not quite the same thing as not caring about
FUA, but the result is the same: one can't imply the other.

Just for general background, can we please clarify/confirm some of the
semantics involved?

 - FUA applies only to the specific write carrying that flag, and
   says nothing about whether other unordered writes that may be
   sitting in the disk's write-cache have to hit the platters?  ie, it
   is distinct from implying a cache flush around the write, and
   distinct from ordered vs unordered writes. it's up to the host and
   filesystem and/or application to worry about structural
   dependencies and transactional boundaries using these tools.

 - O_*SYNC applies to all IO done with the file descriptor, and
   basically does imply an fsync() or at least fdatasync(), even for
   data written by other processes with the same file open.

If these interpretations are correct, it's not a good mapping. O_*SYNC
and fsync() may need to imply FUA and/or a cache flush to meet their
"stable storage" obligations, but that's not the present discussion.
What you're after is the ability to expose more selective, per-write
semantics that map closer to the low-level interface - no surprise
since you're trying to implement that interface again via iSCSI.

A better mapping, and a better interface for implementing what you
want, and what some of those other databasey applications want, would
be async io rather than directio.  You can clearly go some way towards
emulating ordered vs unordered commands via the kernel cache with
write and fsync (if you want).  Using directio is almost a step
backwards; you might need to reimplement caching at userlevel to
maintain some of this functionality. However, you can't do much to
optimally service multiple unordered reads without io concurrency at
the syscall interface.

An async io syscall interface is what you want anyway, the closest
mapping to TCQ and the other device properties you're trying to
emulate/expose, and (i suspect) the place where a per-IO FUA-style
flag best fits.  Especially if FUA implies nothing about write
reordering, as I assume above.

I'm assuming directio is about caching and locking coherency in/with
UVM, and isn't an async interface to userland. I'm guessing directio
internals might somehow eventually help implementing aync io as well,
but otherwise I'm not sure how the two really relate.

I know you know all this already, and that you're trying to implement
on what we have now at least until something better comes along.
About the best (least disruptive) thing I can think of with the
present interface is an ioctl() that says, one shot at a time, that
the *next* write() needs FUA. The ioctl would set a flag, write would
see the flag, set whatever's necessary to trigger FUA to the lower
layers, and clear the flag again.  The ugliness of this (especially
for threaded apps, which might wind up setting FUA for the wrong write
if they're not careful) says something about a fundamental impedance
mismatch that goes much further than just finding a place to
communicate a desire for FUA semantics.

--
Dan.
--QgvTbcZPsSS/HkXe
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (NetBSD)

iD8DBQFEGLcPEAVxvV4N66cRApKUAKDgbijLr26W2mFajGj4O/te/Jv4QwCgutcQ
vEdt4nJhZlTtMx0GZQXXOtQ=
=SFI/
-----END PGP SIGNATURE-----

--QgvTbcZPsSS/HkXe--