Subject: Re: letting userland issue FUA writes
To: Joachim Koenig-Baltes <>
From: Bill Studenmund <>
List: tech-kern
Date: 03/17/2006 14:43:28
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Mar 17, 2006 at 10:34:11PM +0100, Joachim Koenig-Baltes wrote:
> Thor Lancelot Simon wrote:
> >On Fri, Mar 17, 2006 at 09:43:04AM +0100, Joachim K?nig-Baltes wrote:
> >
> >It seems to me that pwritev() or something like it is exactly the
> >right interface for passing a raw write request down through userland
> >and the kernel to a disk.
> I agree that the pwritev is the right interface to handle
> it (and not fsync_range).
> But why do you want to restrict it to pwritev() and not provide it for
> writev(), pwrite() and write()?

My thought was to have one call passed an iovec & flags and another passed
a pointer, length, and flags. Oh, both also passed an offset. Thus we
easily check parameters per call prototype.

Don't include the FUA flag (whatever it will be), and the above calls act=
like pwrite{,v}().

> And if I understood Bill's requirement correctly, he wants to express
> it on a range by range base, so it might be useful to specify the
> flags for each iovec item, so adding a flags field to struct iovec
> instead of an additional argument to pwritev seems more appropriate to=20
> me. And as the struct iovec members appear as individual arguments in
> write() and pwrite() we might then add a flags argument to those two
> too:

It would not be useful to specify it on a range by range basis within one=
system call. Further, it could greatly complicate handling the call.

Under what circumstances will we want different flags for different iovecs=
in the same call?

I think it's a bad idea (at least for FUA) as having different ranges in=20
the same system call. What does it mean if there is an iovec with FUA, one=
without, and then one with? Well, yes, we don't return if the first and=20
third ranges aren't complete. But it means that some layer in the kernel=20
will have to turn the single iovec into a sequence of operations. I do not=
like the concept of that. If you want separate ranges, make separate=20

>      struct iovec {
>          void *iov_base;
>          size_t iov_len;
> 	 uint32_t flags;
>      };
>      ssize_t
>      write(int d, const void *buf, size_t nbytes, uint32_t flags);
>      ssize_t
>      pwrite(int d, const void *buf, size_t nbytes, off_t offset,
>          uint32_t flags);
>      ssize_t
>      writev(int d, const struct iovec *iov, int iovcnt);
>      ssize_t
>      pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);
> I'm not arguing we should do it as we would change an interface that
> is widely used in contrast to changing fsync_range.

Why are we limited to those two options? I like option 3), add a new call=
or two.

I have never considered changing write(2), writev(2), pwrite(2), or
pwritev(2). :-) I think that would add WAY too much grief.

> pwrite() is a bit odd because the flags argument appears after offset
> and not after nbytes, but changing the "offset" argument position would
> be worse. Why was the offset not put into a "struct piovec" for pwritev,
> allowing for different chunks in the vector to be written to
> different positions in the file (and not contiguous) when the
> interface for pwritev was designed?

Because these calls are supposed to be fast.=20

Also, we have requirements on atomicity, which will become REALLY messy if=
we had separate offsets per "piovec".

I know of a lot of places where seek()/write() or seek()/writev() happen,
and pwrite()/pwritev() are great for them. I don't know of that many
applications that need to have one operation write to multiple parts of a
file. In one call.

If you want to experiment with such a thing, go for it. If we are cooking=
up a new interface, though, I am only interested in basically a pwrite()=20
and a pwritev() calls (with different names!!) that also takes a flags=20
value, and the flags value can include a flag to trigger FUA.

Take care,


Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.2.3 (NetBSD)