Subject: Re: letting userland issue FUA writes
To: None <tls@rek.tjls.com>
From: Joachim Koenig-Baltes <joachim.koenig-baltes@emesgarten.de>
List: tech-kern
Date: 03/17/2006 22:34:11
Thor Lancelot Simon wrote:
> On Fri, Mar 17, 2006 at 09:43:04AM +0100, Joachim K?nig-Baltes wrote:
> 
>> Or add a "const void *buf" argument to fsync_range, not as general as
>> with pwritev or adding syscalls, but perhaps less intrusive on the
>> interfaces.
> 
> I can't imagine why this would be "less intrusive" than adding an
> optional flags argument to pwritev.  What is *buf supposed to mean,
> here, "write this right now and treat it as if it had had fsync_range
> applied to it after the fact"?  That seems gross, and it's no smaller
> an interface change than the pwritev() one.
> 
> It seems to me that pwritev() or something like it is exactly the
> right interface for passing a raw write request down through userland
> and the kernel to a disk.

I agree that the pwritev is the right interface to handle
it (and not fsync_range).

But why do you want to restrict it to pwritev() and not provide it for
writev(), pwrite() and write()?

And if I understood Bill's requirement correctly, he wants to express
it on a range by range base, so it might be useful to specify the
flags for each iovec item, so adding a flags field to struct iovec
instead of an additional argument to pwritev seems more appropriate to 
me. And as the struct iovec members appear as individual arguments in
write() and pwrite() we might then add a flags argument to those two
too:

      struct iovec {
          void *iov_base;
          size_t iov_len;
	 uint32_t flags;
      };

      ssize_t
      write(int d, const void *buf, size_t nbytes, uint32_t flags);

      ssize_t
      pwrite(int d, const void *buf, size_t nbytes, off_t offset,
          uint32_t flags);

      ssize_t
      writev(int d, const struct iovec *iov, int iovcnt);

      ssize_t
      pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);


I'm not arguing we should do it as we would change an interface that
is widely used in contrast to changing fsync_range.

pwrite() is a bit odd because the flags argument appears after offset
and not after nbytes, but changing the "offset" argument position would
be worse. Why was the offset not put into a "struct piovec" for pwritev,
allowing for different chunks in the vector to be written to
different positions in the file (and not contiguous) when the
interface for pwritev was designed?

Joachim