tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Proposal: B_ARRIER (addresses wapbl performance?)



On Sun, Nov 02, 2008 at 01:50:25PM -0500, Thor Lancelot Simon wrote:
> On Sun, Nov 02, 2008 at 07:57:37AM -0800, Bill Stouder-Studenmund wrote:
> > 
> > So Thor, why are you so entrenched in this? If you're going to add a bit, 
> > just add FUA. It does exactly what you want. It was designed to do what 
> > you want.
> 
> We can't enqueue new commands while waiting for a cache flush to complete.

For SCSI systems, SBC does not indicate that this is a requirement. We 
should be able to continue enqueueing new commands while the cache is 
flushing.

> We *can* enqueue new commands while waiting for an ordered tag (and all
> its prior simple tags) to complete.  This seems like a huge advantage to
> me.  I also see other uses for disk write barriers beyond just solving
> WAPBL's immediate problem.

It depends. How much i/o are we waiting for and how much is blocked 
waiting? 

> The other thing is that if you look at other journalled filesystems
> (particularly XFS) they manage to use write barriers to not do every
> journal write synchronously (okay, pseudo-synchronously).  Given that we
> don't support an external journal this seems particularly interesting to
> me.

You're still using the wrong tool for things.

Another thing to look at would be to see how drives have supported FUA 
over time. Both tagged queuing and FUA came in with SCSI-2. All the tags 
are, as best I can tell from reading the original spec, required if you 
support tagged queuing. They were optional if you didn't support tagged 
queuing. FUA (and DPO) were also optional. But I think tagged queuing was 
supported sooner since it was a notable performance improvement.

The point I'm trying to circle is that things like XFS may well have been 
written back at a time when tagged queuing was much more likely to be 
supported (and correctly supported) than FUA.

My experience as a target maker was that Windows uses FUA but we never saw 
ORDERED tags out of it.

> Also, it is clear to me how to implement B_BARRIER -- I know exactly what
> to modify and how, except for the disk sorting code, which I'm reasonably
> confident I can figure out.  I do not know how to implement FUA for each
> kind of disk in the system -- if you do, by all means, be my guest!

This I understand!

To be honest, I think the things you're describing are what we should do 
if FUA doesn't work on a drive.

I'm attaching diffs (from 2006, sorry!) that implement FUA and also the 
writex() command I talked about back at that time. Sadly, there also are 
some conflicts in the code. But they should be easy to sort out.

So my suggestion is do this, then if a device doesn't support FUA, fall 
back to either the synchronize cache or B_BARRIER things you're 
describing.

Take care,

Bill
? .BUFPATHS.swp
? BUFPATHS
? TODO
? TheTools
? cvslog
? diffie.20060928
? lib/libc/sys/pwritex.c
? sys/arch/i386/compile/GENERIC
Index: sys/dev/ld.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ld.c,v
retrieving revision 1.40
diff -p -u -r1.40 ld.c
--- sys/dev/ld.c        28 Mar 2006 17:38:29 -0000      1.40
+++ sys/dev/ld.c        1 Nov 2006 23:18:38 -0000
@@ -66,6 +66,7 @@ __KERNEL_RCSID(0, "$NetBSD: ld.c,v 1.40 
 #if NRND > 0
 #include <sys/rnd.h>
 #endif
+#include <miscfs/specfs/specdev.h>
 
 #include <dev/ldvar.h>
 
@@ -361,16 +362,28 @@ ldclose(dev_t dev, int flags, int fmt, s
 static int
 ldread(dev_t dev, struct uio *uio, int ioflag)
 {
+       int     f = B_READ;
 
-       return (physio(ldstrategy, NULL, dev, B_READ, ldminphys, uio));
+       if (ioflag & IO_FUA)
+               f |= B_FUA;
+       if (ioflag & IO_FUA_NV)
+               f |= B_FUA_NV;
+
+       return (physio(ldstrategy, NULL, dev, f, ldminphys, uio));
 }
 
 /* ARGSUSED */
 static int
 ldwrite(dev_t dev, struct uio *uio, int ioflag)
 {
+       int     f = B_WRITE;
+
+       if (ioflag & IO_FUA)
+               f |= B_FUA;
+       if (ioflag & IO_FUA_NV)
+               f |= B_FUA_NV;
 
-       return (physio(ldstrategy, NULL, dev, B_WRITE, ldminphys, uio));
+       return (physio(ldstrategy, NULL, dev, f, ldminphys, uio));
 }
 
 /* ARGSUSED */
@@ -523,6 +536,11 @@ ldioctl(dev_t dev, u_long cmd, caddr_t a
                return (dkwedge_list(&sc->sc_dk, dkwl, l));
            }
 
+       case DIOCGCAPS:
+               if (sc->sc_flags & LDF_CAN_FUA)
+                       *(int *) addr = V_CAP_CAN_FUA;
+               break;
+
        default:
                error = ENOTTY;
                break;
Index: sys/dev/ldvar.h
===================================================================
RCS file: /cvsroot/src/sys/dev/ldvar.h,v
retrieving revision 1.11
diff -p -u -r1.11 ldvar.h
--- sys/dev/ldvar.h     11 Dec 2005 12:20:53 -0000      1.11
+++ sys/dev/ldvar.h     1 Nov 2006 23:18:38 -0000
@@ -72,6 +72,7 @@ struct ld_softc {
 #define        LDF_DETACH      0x040           /* detach pending */
 #define        LDF_KLABEL      0x080           /* keep label on close */
 #define        LDF_VLABEL      0x100           /* label is valid */
+#define        LDF_CAN_FUA     0x200           /* Device handles FUA */
 
 int    ldadjqparam(struct ld_softc *, int);
 void   ldattach(struct ld_softc *);
Index: sys/dev/scsipi/sd.c
===================================================================
RCS file: /cvsroot/src/sys/dev/scsipi/sd.c,v
retrieving revision 1.250
diff -p -u -r1.250 sd.c
--- sys/dev/scsipi/sd.c 14 Sep 2006 17:54:34 -0000      1.250
+++ sys/dev/scsipi/sd.c 1 Nov 2006 23:18:42 -0000
@@ -80,6 +80,7 @@ __KERNEL_RCSID(0, "$NetBSD: sd.c,v 1.250
 #if NRND > 0
 #include <sys/rnd.h>
 #endif
+#include <miscfs/specfs/specdev.h>
 
 #include <dev/scsipi/scsi_spc.h>
 #include <dev/scsipi/scsipi_all.h>
@@ -828,11 +829,13 @@ sdstart(struct scsipi_periph *periph)
 
                /*
                 * Fill out the scsi command.  Use the smallest CDB possible
-                * (6-byte, 10-byte, or 16-byte).
+                * (6-byte, 10-byte, or 16-byte). FUA commands have to
+                * use 10-byte or 16-byte.
                 */
                if (((bp->b_rawblkno & 0x1fffff) == bp->b_rawblkno) &&
                    ((nblks & 0xff) == nblks) &&
-                   !(periph->periph_quirks & PQUIRK_ONLYBIG)) {
+                   !(periph->periph_quirks & PQUIRK_ONLYBIG) &&
+                   !(bp->b_flags & (B_FUA | B_FUA_NV))) {
                        /* 6-byte CDB */
                        memset(&cmd_small, 0, sizeof(cmd_small));
                        cmd_small.opcode = (bp->b_flags & B_READ) ?
@@ -846,6 +849,10 @@ sdstart(struct scsipi_periph *periph)
                        memset(&cmd_big, 0, sizeof(cmd_big));
                        cmd_big.opcode = (bp->b_flags & B_READ) ?
                            READ_10 : WRITE_10;
+                       if (bp->b_flags & B_FUA)
+                               cmd_big.byte2 |= SRWB_FUA;
+                       if (bp->b_flags & B_FUA_NV)
+                               cmd_big.byte2 |= SRWB_FUA_NV;
                        _lto4b(bp->b_rawblkno, cmd_big.addr);
                        _lto2b(nblks, cmd_big.length);
                        cmdlen = sizeof(cmd_big);
@@ -855,6 +862,10 @@ sdstart(struct scsipi_periph *periph)
                        memset(&cmd16, 0, sizeof(cmd16));
                        cmd16.opcode = (bp->b_flags & B_READ) ?
                            READ_16 : WRITE_16;
+                       if (bp->b_flags & B_FUA)
+                               cmd_big.byte2 |= SRWB_FUA;
+                       if (bp->b_flags & B_FUA_NV)
+                               cmd_big.byte2 |= SRWB_FUA_NV;
                        _lto8b(bp->b_rawblkno, cmd16.addr);
                        _lto4b(nblks, cmd16.length);
                        cmdlen = sizeof(cmd16);
@@ -983,15 +994,27 @@ sdminphys(struct buf *bp)
 static int
 sdread(dev_t dev, struct uio *uio, int ioflag)
 {
+       int     f = B_READ;
 
-       return (physio(sdstrategy, NULL, dev, B_READ, sdminphys, uio));
+       if (ioflag & IO_FUA)
+               f |= B_FUA;
+       if (ioflag & IO_FUA_NV)
+               f |= B_FUA_NV;
+
+       return (physio(sdstrategy, NULL, dev, f, sdminphys, uio));
 }
 
 static int
 sdwrite(dev_t dev, struct uio *uio, int ioflag)
 {
+       int     f = B_WRITE;
+
+       if (ioflag & IO_FUA)
+               f |= B_FUA;
+       if (ioflag & IO_FUA_NV)
+               f |= B_FUA_NV;
 
-       return (physio(sdstrategy, NULL, dev, B_WRITE, sdminphys, uio));
+       return (physio(sdstrategy, NULL, dev, f, sdminphys, uio));
 }
 
 /*
@@ -1239,6 +1262,11 @@ bad:
                return (dkwedge_list(&sd->sc_dk, dkwl, l));
            }
 
+       case DIOCGCAPS:
+               /* Report capabilities */
+               *(int *) addr = V_CAP_CAN_FUA;
+               return 0;
+
        default:
                if (part != RAW_PART)
                        return (ENOTTY);
Index: sys/kern/init_sysent.c
===================================================================
RCS file: /cvsroot/src/sys/kern/init_sysent.c,v
retrieving revision 1.181
diff -p -u -r1.181 init_sysent.c
--- sys/kern/init_sysent.c      1 Sep 2006 21:04:45 -0000       1.181
+++ sys/kern/init_sysent.c      1 Nov 2006 23:18:45 -0000
@@ -1,4 +1,4 @@
-/* $NetBSD: init_sysent.c,v 1.181 2006/09/01 21:04:45 matt Exp $ */
+/* $NetBSD$ */
 
 /*
  * System call switch table.
@@ -8,7 +8,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: init_sysent.c,v 1.181 2006/09/01 21:04:45 matt Exp 
$");
+__KERNEL_RCSID(0, "$NetBSD$");
 
 #include "opt_ktrace.h"
 #include "opt_nfsserver.h"
@@ -1048,10 +1048,10 @@ struct sysent sysent[] = {
            sys___fhstatvfs140 },               /* 397 = __fhstatvfs140 */
        { 3, s(struct sys___fhstat40_args), 0,
            sys___fhstat40 },                   /* 398 = __fhstat40 */
-       { 0, 0, 0,
-           sys_nosys },                        /* 399 = filler */
-       { 0, 0, 0,
-           sys_nosys },                        /* 400 = filler */
+       { 5, s(struct sys_pwritex_args), 0,
+           sys_pwritex },                      /* 399 = pwritex */
+       { 5, s(struct sys_preadx_args), 0,
+           sys_preadx },                       /* 400 = preadx */
        { 0, 0, 0,
            sys_nosys },                        /* 401 = filler */
        { 0, 0, 0,
Index: sys/kern/kern_physio.c
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_physio.c,v
retrieving revision 1.74
diff -p -u -r1.74 kern_physio.c
--- sys/kern/kern_physio.c      5 Oct 2006 14:48:32 -0000       1.74
+++ sys/kern/kern_physio.c      1 Nov 2006 23:18:45 -0000
@@ -287,7 +287,7 @@ physio(void (*strategy)(struct buf *), s
        DPRINTF(("%s: called: off=%" PRIu64 ", resid=%zu\n",
            __func__, uio->uio_offset, uio->uio_resid));
 
-       flags &= B_READ | B_WRITE;
+       flags &= B_READ | B_WRITE | B_FUA | B_FUA_NV;
 
        /* Make sure we have a buffer, creating one if necessary. */
        if (obp != NULL) {
Index: sys/kern/syscalls.c
===================================================================
RCS file: /cvsroot/src/sys/kern/syscalls.c,v
retrieving revision 1.177
diff -p -u -r1.177 syscalls.c
--- sys/kern/syscalls.c 1 Sep 2006 22:23:18 -0000       1.177
+++ sys/kern/syscalls.c 1 Nov 2006 23:18:45 -0000
@@ -1,4 +1,4 @@
-/* $NetBSD: syscalls.c,v 1.177 2006/09/01 22:23:18 matt Exp $ */
+/* $NetBSD$ */
 
 /*
  * System call names.
@@ -8,7 +8,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: syscalls.c,v 1.177 2006/09/01 22:23:18 matt Exp 
$");
+__KERNEL_RCSID(0, "$NetBSD$");
 
 #if defined(_KERNEL_OPT)
 #include "opt_ktrace.h"
@@ -534,4 +534,6 @@ const char *const syscallnames[] = {
        "__fhopen40",                   /* 396 = __fhopen40 */
        "__fhstatvfs140",                       /* 397 = __fhstatvfs140 */
        "__fhstat40",                   /* 398 = __fhstat40 */
+       "pwritex",                      /* 399 = pwritex */
+       "preadx",                       /* 400 = preadx */
 };
Index: sys/kern/syscalls.master
===================================================================
RCS file: /cvsroot/src/sys/kern/syscalls.master,v
retrieving revision 1.159
diff -p -u -r1.159 syscalls.master
--- sys/kern/syscalls.master    1 Sep 2006 20:58:18 -0000       1.159
+++ sys/kern/syscalls.master    1 Nov 2006 23:18:46 -0000
@@ -797,3 +797,9 @@
                            size_t fh_size, struct statvfs *buf, int flags); }
 398    STD             { int sys___fhstat40(const void *fhp, \
                            size_t fh_size, struct stat *sb); }
+399    STD             { ssize_t sys_pwritex(int fd, int flags, \
+                           const struct iovec *iovp, int iovcnt, \
+                           off_t offset); }
+400    STD             { ssize_t sys_preadx(int fd, int flags, \
+                           const struct iovec *iovp, int iovcnt, \
+                           off_t offset); }
Index: sys/kern/vfs_bio.c
===================================================================
RCS file: /cvsroot/src/sys/kern/vfs_bio.c,v
retrieving revision 1.163
diff -p -u -r1.163 vfs_bio.c
--- sys/kern/vfs_bio.c  10 Sep 2006 06:35:42 -0000      1.163
+++ sys/kern/vfs_bio.c  1 Nov 2006 23:18:47 -0000
@@ -706,11 +706,13 @@ bwrite(struct buf *bp)
         * Remember buffer type, to switch on it later.  If the write was
         * synchronous, but the file system was mounted with MNT_ASYNC,
         * convert it to a delayed write.
+        * FUA writes remain synchronous.
         * XXX note that this relies on delayed tape writes being converted
         * to async, not sync writes (which is safe, but ugly).
         */
        sync = !ISSET(bp->b_flags, B_ASYNC);
-       if (sync && mp != NULL && ISSET(mp->mnt_flag, MNT_ASYNC)) {
+       if (sync && mp != NULL && ISSET(mp->mnt_flag, MNT_ASYNC)
+           && !ISSET(bp->b_flags, B_FUA | B_FUA_NV)) {
                bdwrite(bp);
                return (0);
        }
@@ -1090,6 +1092,7 @@ start:
                allocbuf(bp, size, preserve);
        }
        BIO_SETPRIO(bp, BPRIO_DEFAULT);
+       CLR(bp->b_flags, B_FUA | B_FUA_NV);
        return (bp);
 }
 
@@ -1830,6 +1833,10 @@ nestiobuf_setup(struct buf *mbp, struct 
        KASSERT(mbp->b_bcount >= offset + size);
        bp->b_vp = vp;
        bp->b_flags = B_BUSY | B_CALL | B_ASYNC | b_read;
+       if (mbp->b_flags & B_FUA)
+               bp->b_flags |= B_FUA;
+       if (mbp->b_flags & B_FUA_NV)
+               bp->b_flags |= B_FUA_NV;
        bp->b_iodone = nestiobuf_iodone;
        bp->b_data = mbp->b_data + offset;
        bp->b_resid = bp->b_bcount = size;
Index: sys/kern/vfs_syscalls.c
===================================================================
RCS file: /cvsroot/src/sys/kern/vfs_syscalls.c,v
retrieving revision 1.270
diff -p -u -r1.270 vfs_syscalls.c
--- sys/kern/vfs_syscalls.c     13 Sep 2006 10:07:42 -0000      1.270
+++ sys/kern/vfs_syscalls.c     1 Nov 2006 23:18:50 -0000
@@ -2265,6 +2265,83 @@ sys_preadv(struct lwp *l, void *v, regis
 }
 
 /*
+ * Positional scatter read system call.
+ */
+int
+sys_preadx(struct lwp *l, void *v, register_t *retval)
+{
+       struct sys_preadx_args /* {
+               syscallarg(int) fd;
+               syscallarg(int) flags;
+               syscallarg(const struct iovec *) iovp;
+               syscallarg(int) iovcnt;
+               syscallarg(off_t) offset;
+       } */ *uap = v;
+       struct proc *p = l->l_proc;
+       struct filedesc *fdp = p->p_fd;
+       struct file *fp;
+       struct vnode *vp;
+       off_t offset, *ofp;
+       int error, fd, flags, f1;
+
+       fd = SCARG(uap, fd);
+       flags = SCARG(uap, flags);
+       f1 = 0;
+
+       if ((fp = fd_getfile(fdp, fd)) == NULL)
+               return (EBADF);
+
+       if ((fp->f_flag & FREAD) == 0) {
+               simple_unlock(&fp->f_slock);
+               return (EBADF);
+       }
+
+       FILE_USE(fp);
+
+       vp = (struct vnode *)fp->f_data;
+       if (fp->f_type != DTYPE_VNODE || vp->v_type == VFIFO) {
+               error = ESPIPE;
+               goto out;
+       }
+
+       offset = SCARG(uap, offset);
+
+       if (flags & PXIO_FPOINTER) {
+               /* Ok, someone wants to update the file pointer. Oh well. */
+               ofp = &fp->f_offset;
+               f1 = FOF_UPDATE_OFFSET;
+       } else {
+               offset = SCARG(uap, offset);
+               ofp = &offset;
+               f1 = 0;
+               /*
+                * XXX This works because no file systems actually
+                * XXX take any action on the seek operation.
+                */
+               if ((error = VOP_SEEK(vp, fp->f_offset, offset, fp->f_cred)))
+                       goto out;
+       }
+
+       if (flags & PXIO_FUA)
+               f1 |= FOF_FUA;
+       if (flags & PXIO_FUA_NV)
+               f1 |= FOF_FUA_NV;
+
+#if 0 /* Not yet */
+       if (flags & PXIO_DIRECT)
+               f1 |= XXX;
+#endif
+
+       /* dofilereadv() will unuse the descriptor for us */
+       return (dofilereadv(l, fd, fp, SCARG(uap, iovp), SCARG(uap, iovcnt),
+           ofp, f1, retval));
+
+ out:
+       FILE_UNUSE(fp, l);
+       return (error);
+}
+
+/*
  * Positional write system call.
  */
 int
@@ -2371,6 +2448,81 @@ sys_pwritev(struct lwp *l, void *v, regi
 }
 
 /*
+ * Positional gather write with flags system call.
+ */
+int
+sys_pwritex(struct lwp *l, void *v, register_t *retval)
+{
+       struct sys_pwritex_args /* {
+               syscallarg(int) fd;
+               syscallarg(int) flags;
+               syscallarg(const struct iovec *) iovp;
+               syscallarg(int) iovcnt;
+               syscallarg(off_t) offset;
+       } */ *uap = v;
+       struct proc *p = l->l_proc;
+       struct filedesc *fdp = p->p_fd;
+       struct file *fp;
+       struct vnode *vp;
+       off_t offset, *ofp;
+       int error, fd, flags, f1;
+
+       fd = SCARG(uap, fd);
+       flags = SCARG(uap, flags);
+       f1 = 0;
+
+       if ((fp = fd_getfile(fdp, fd)) == NULL)
+               return (EBADF);
+
+       if ((fp->f_flag & FWRITE) == 0) {
+               simple_unlock(&fp->f_slock);
+               return (EBADF);
+       }
+
+       FILE_USE(fp);
+
+       vp = (struct vnode *)fp->f_data;
+       if (fp->f_type != DTYPE_VNODE || vp->v_type == VFIFO) {
+               error = ESPIPE;
+               goto out;
+       }
+
+       if (flags & PXIO_FPOINTER) {
+               /* Ok, someone wants to update the file pointer. Oh well. */
+               ofp = &fp->f_offset;
+               f1 = FOF_UPDATE_OFFSET;
+       } else {
+               offset = SCARG(uap, offset);
+               ofp = &offset;
+               f1 = 0;
+               /*
+                * XXX This works because no file systems actually
+                * XXX take any action on the seek operation.
+                */
+               if ((error = VOP_SEEK(vp, fp->f_offset, offset, fp->f_cred)))
+                       goto out;
+       }
+
+       if (flags & PXIO_FUA)
+               f1 |= FOF_FUA;
+       if (flags & PXIO_FUA_NV)
+               f1 |= FOF_FUA_NV;
+
+#if 0 /* Not yet */
+       if (flags & PXIO_DIRECT)
+               f1 |= XXX;
+#endif
+
+       /* dofilewritev() will unuse the descriptor for us */
+       return (dofilewritev(l, fd, fp, SCARG(uap, iovp), SCARG(uap, iovcnt),
+           ofp, f1, retval));
+
+ out:
+       FILE_UNUSE(fp, l);
+       return (error);
+}
+
+/*
  * Check access permissions.
  */
 int
Index: sys/kern/vfs_vnops.c
===================================================================
RCS file: /cvsroot/src/sys/kern/vfs_vnops.c,v
retrieving revision 1.125
diff -p -u -r1.125 vfs_vnops.c
--- sys/kern/vfs_vnops.c        5 Oct 2006 14:48:32 -0000       1.125
+++ sys/kern/vfs_vnops.c        1 Nov 2006 23:18:50 -0000
@@ -491,14 +491,29 @@ vn_read(struct file *fp, off_t *offset, 
                ioflag |= IO_SYNC;
        if (fp->f_flag & FALTIO)
                ioflag |= IO_ALTSEMANTICS;
+<<<<<<< vfs_vnops.c
+       if (flags & FOF_FUA_NV)
+               ioflag |= IO_FUA_NV | IO_SYNC;
+       if (flags & FOF_FUA)
+               ioflag |= IO_FUA | IO_SYNC;
+=======
        if (fp->f_flag & FDIRECT)
                ioflag |= IO_DIRECT;
+>>>>>>> 1.125
        vn_lock(vp, LK_SHARED | LK_RETRY);
        uio->uio_offset = *offset;
        count = uio->uio_resid;
+       if (flags & FOF_FUA) {
+               /* Purge any existing pages from the uvm cache */
+               error = VOP_PUTPAGES(vp, *offset, *offset + count,
+                       PGO_FREE | PGO_SYNCIO | PGO_CLEANIT);
+               if (error)
+                       goto out;
+       }
        error = VOP_READ(vp, uio, ioflag, cred);
        if (flags & FOF_UPDATE_OFFSET)
                *offset += count - uio->uio_resid;
+out:
        VOP_UNLOCK(vp, 0);
        return (error);
 }
@@ -526,8 +541,15 @@ vn_write(struct file *fp, off_t *offset,
                ioflag |= IO_DSYNC;
        if (fp->f_flag & FALTIO)
                ioflag |= IO_ALTSEMANTICS;
+<<<<<<< vfs_vnops.c
+       if (flags & FOF_FUA_NV)
+               ioflag |= IO_FUA_NV | IO_SYNC;
+       if (flags & FOF_FUA)
+               ioflag |= IO_FUA | IO_SYNC;
+=======
        if (fp->f_flag & FDIRECT)
                ioflag |= IO_DIRECT;
+>>>>>>> 1.125
        mp = NULL;
        if (vp->v_type != VCHR &&
            (error = vn_start_write(vp, &mp, V_WAIT | V_PCATCH)) != 0)
Index: sys/miscfs/genfs/genfs_vnops.c
===================================================================
RCS file: /cvsroot/src/sys/miscfs/genfs/genfs_vnops.c,v
retrieving revision 1.130
diff -p -u -r1.130 genfs_vnops.c
--- sys/miscfs/genfs/genfs_vnops.c      5 Oct 2006 14:48:32 -0000       1.130
+++ sys/miscfs/genfs/genfs_vnops.c      1 Nov 2006 23:18:53 -0000
@@ -1539,8 +1539,15 @@ genfs_do_io(struct vnode *vp, off_t off,
        mbp->b_bufsize = len;
        mbp->b_data = (void *)kva;
        mbp->b_resid = mbp->b_bcount = bytes;
+<<<<<<< genfs_vnops.c
+       mbp->b_flags = B_BUSY|B_WRITE|B_AGE| (async ? (B_CALL|B_ASYNC) : 0);
+       if (flags & PGO_FUA)
+               mbp->b_flags |= B_FUA;
+       mbp->b_iodone = uvm_aio_biodone;
+=======
        mbp->b_flags = B_BUSY | brw | B_AGE | (async ? (B_CALL | B_ASYNC) : 0);
        mbp->b_iodone = iodone;
+>>>>>>> 1.130
        mbp->b_vp = vp;
        if (curproc == uvm.pagedaemon_proc)
                BIO_SETPRIO(mbp, BPRIO_TIMELIMITED);
Index: sys/miscfs/specfs/spec_vnops.c
===================================================================
RCS file: /cvsroot/src/sys/miscfs/specfs/spec_vnops.c,v
retrieving revision 1.92
diff -p -u -r1.92 spec_vnops.c
--- sys/miscfs/specfs/spec_vnops.c      30 Sep 2006 21:00:13 -0000      1.92
+++ sys/miscfs/specfs/spec_vnops.c      1 Nov 2006 23:18:53 -0000
@@ -303,6 +303,9 @@ spec_open(v)
                return error;
        if (!(*d_ioctl)(vp->v_rdev, DIOCGPART, (caddr_t)&pi, FREAD, curlwp))
                vp->v_size = (voff_t)pi.disklab->d_secsize * pi.part->p_size;
+       vp->v_spec_cap = 0;
+       (*d_ioctl)(vp->v_rdev, DIOCGCAPS, (caddr_t)&vp->v_spec_cap, FREAD,
+               curlwp);
        return 0;
 }
 
@@ -426,6 +429,9 @@ spec_write(v)
        switch (vp->v_type) {
 
        case VCHR:
+               if ((ap->a_ioflag & (IO_FUA | IO_FUA_NV))
+                   && (~vp->v_spec_cap & V_CAP_CAN_FUA))
+                       return (ENODEV);
                VOP_UNLOCK(vp, 0);
                cdev = cdevsw_lookup(vp->v_rdev);
                if (cdev != NULL)
@@ -440,6 +446,9 @@ spec_write(v)
                        return (0);
                if (uio->uio_offset < 0)
                        return (EINVAL);
+               if ((ap->a_ioflag & (IO_FUA | IO_FUA_NV))
+                   && (~vp->v_spec_cap & V_CAP_CAN_FUA))
+                       return (ENODEV);
                bsize = BLKDEV_IOSIZE;
                bdev = bdevsw_lookup(vp->v_rdev);
                if (bdev != NULL &&
@@ -468,7 +477,14 @@ spec_write(v)
                        if (error)
                                brelse(bp);
                        else {
-                               if (n + on == bsize)
+                               if (ap->a_ioflag & (IO_FUA | IO_FUA_NV)) {
+                                       if (ap->a_ioflag & IO_FUA)
+                                               SET(bp->b_flags, B_FUA);
+                                       if (ap->a_ioflag & IO_FUA_NV)
+                                               SET(bp->b_flags, B_FUA_NV);
+                                       SET(bp->b_flags, B_SYNC);
+                                       bwrite(bp);
+                               } else if (n + on == bsize)
                                        bawrite(bp);
                                else
                                        bdwrite(bp);
@@ -662,6 +678,13 @@ spec_strategy(v)
 
        error = 0;
        bp->b_dev = vp->v_rdev;
+       if ((bp->b_flags & (B_FUA | B_FUA_NV))
+           && (~vp->v_spec_cap & V_CAP_CAN_FUA)) {
+               /* Reject B_FUA if device can't do it */
+               error = ENODEV;
+               goto error_out;
+       }
+
        if (!(bp->b_flags & B_READ) &&
            (LIST_FIRST(&bp->b_dep)) != NULL && bioops.io_start)
                (*bioops.io_start)(bp);
@@ -686,16 +709,18 @@ spec_strategy(v)
                SPEC_COW_UNLOCK(vp->v_specinfo, s);
        }
 
-       if (error) {
-               bp->b_error = error;
-               bp->b_flags |= B_ERROR;
-               biodone(bp);
-               return (error);
-       }
+       if (error)
+               goto error_out;
 
        DEV_STRATEGY(bp);
 
        return (0);
+
+error_out:
+       bp->b_error = error;
+       bp->b_flags |= B_ERROR;
+       biodone(bp);
+       return (error);
 }
 
 int
Index: sys/miscfs/specfs/specdev.h
===================================================================
RCS file: /cvsroot/src/sys/miscfs/specfs/specdev.h,v
retrieving revision 1.30
diff -p -u -r1.30 specdev.h
--- sys/miscfs/specfs/specdev.h 14 May 2006 21:32:21 -0000      1.30
+++ sys/miscfs/specfs/specdev.h 1 Nov 2006 23:18:53 -0000
@@ -49,6 +49,7 @@ struct specinfo {
        struct  vnode *si_specnext;
        struct  mount *si_mountpoint;
        dev_t   si_rdev;
+       int     si_cap;
        struct  lockf *si_lockf;
        struct simplelock si_cow_slock;
        SLIST_HEAD(, spec_cow_entry) si_cow_head;
@@ -63,6 +64,7 @@ struct specinfo {
 #define v_specnext     v_specinfo->si_specnext
 #define v_speclockf    v_specinfo->si_lockf
 #define v_specmountpoint v_specinfo->si_mountpoint
+#define v_spec_cap      v_specinfo->si_cap
 #define v_spec_cow_slock v_specinfo->si_cow_slock
 #define v_spec_cow_head        v_specinfo->si_cow_head
 #define v_spec_cow_req v_specinfo->si_cow_req
@@ -81,6 +83,14 @@ struct specinfo {
        } while (/*CONSTCOND*/0)
 
 /*
+ * Device capabilites
+ *
+ * spec_open clears capabilities then calls the device's ioctl routine
+ * (via DIOCGCAPS) to fill in supported/required capabilities.
+ */
+#define V_CAP_CAN_FUA          0x0001  /* Device honors FUA flag */
+
+/*
  * Special device management
  */
 #define        SPECHSZ 64
Index: sys/sys/buf.h
===================================================================
RCS file: /cvsroot/src/sys/sys/buf.h,v
retrieving revision 1.89
diff -p -u -r1.89 buf.h
--- sys/sys/buf.h       10 Sep 2006 06:35:42 -0000      1.89
+++ sys/sys/buf.h       1 Nov 2006 23:18:54 -0000
@@ -213,11 +213,14 @@ do {                                                      
                \
 #define        B_WRITE         0x00000000      /* Write buffer (pseudo flag). 
*/
 #define        B_XXX           0x02000000      /* Debugging flag. */
 #define        B_VFLUSH        0x04000000      /* Buffer is being synced. */
+#define        B_FUA           0x08000000      /* Force Unit Access on i/o. */
+#define        B_FUA_NV        0x10000000      /* Force Unit Access on i/o. */
 
 #define BUF_FLAGBITS \
     "\20\1AGE\3ASYNC\4BAD\5BUSY\6SCANNED\7CALL\10DELWRI" \
     "\11DIRTY\12DONE\14ERROR\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \
-    "\22CACHE\23PHYS\24RAW\25READ\26TAPE\30WANTED\32XXX\33VFLUSH"
+    "\22CACHE\23PHYS\24RAW\25READ\26TAPE\30WANTED\32XXX\33VFLUSH" \
+    "\34FUA\35FUA_NV"
 
 
 /*
Index: sys/sys/dkio.h
===================================================================
RCS file: /cvsroot/src/sys/sys/dkio.h,v
retrieving revision 1.12
diff -p -u -r1.12 dkio.h
--- sys/sys/dkio.h      26 Dec 2005 10:36:47 -0000      1.12
+++ sys/sys/dkio.h      1 Nov 2006 23:18:54 -0000
@@ -98,4 +98,7 @@
 #define        DIOCGSTRATEGY   _IOR('d', 125, struct disk_strategy)
 #define        DIOCSSTRATEGY   _IOW('d', 126, struct disk_strategy)
 
+               /* device capabilities, enumerated in specdev.h */
+#define        DIOCGCAPS       _IOR('d', 127, int)
+
 #endif /* _SYS_DKIO_H_ */
Index: sys/sys/file.h
===================================================================
RCS file: /cvsroot/src/sys/sys/file.h,v
retrieving revision 1.56
diff -p -u -r1.56 file.h
--- sys/sys/file.h      14 May 2006 21:38:18 -0000      1.56
+++ sys/sys/file.h      1 Nov 2006 23:18:55 -0000
@@ -149,6 +149,8 @@ do {                                                        
                \
  * Flags for fo_read and fo_write.
  */
 #define        FOF_UPDATE_OFFSET       0x01    /* update the file offset */
+#define        FOF_FUA                 0x02    /* operation req FUA */
+#define        FOF_FUA_NV              0x04    /* operation req FUA */
 
 LIST_HEAD(filelist, file);
 extern struct filelist filehead;       /* head of list of open files */
Index: sys/sys/syscall.h
===================================================================
RCS file: /cvsroot/src/sys/sys/syscall.h,v
retrieving revision 1.174
diff -p -u -r1.174 syscall.h
--- sys/sys/syscall.h   1 Sep 2006 21:04:45 -0000       1.174
+++ sys/sys/syscall.h   1 Nov 2006 23:18:55 -0000
@@ -1,4 +1,4 @@
-/* $NetBSD: syscall.h,v 1.174 2006/09/01 21:04:45 matt Exp $ */
+/* $NetBSD$ */
 
 /*
  * System call numbers.
@@ -1100,6 +1100,12 @@
 /* syscall: "__fhstat40" ret: "int" args: "const void *" "size_t" "struct stat 
*" */
 #define        SYS___fhstat40  398
 
-#define        SYS_MAXSYSCALL  399
+/* syscall: "pwritex" ret: "ssize_t" args: "int" "int" "const struct iovec *" 
"int" "off_t" */
+#define        SYS_pwritex     399
+
+/* syscall: "preadx" ret: "ssize_t" args: "int" "int" "const struct iovec *" 
"int" "off_t" */
+#define        SYS_preadx      400
+
+#define        SYS_MAXSYSCALL  401
 #define        SYS_NSYSENT     512
 #endif /* _SYS_SYSCALL_H_ */
Index: sys/sys/syscallargs.h
===================================================================
RCS file: /cvsroot/src/sys/sys/syscallargs.h,v
retrieving revision 1.156
diff -p -u -r1.156 syscallargs.h
--- sys/sys/syscallargs.h       1 Sep 2006 21:04:45 -0000       1.156
+++ sys/sys/syscallargs.h       1 Nov 2006 23:18:57 -0000
@@ -1,4 +1,4 @@
-/* $NetBSD: syscallargs.h,v 1.156 2006/09/01 21:04:45 matt Exp $ */
+/* $NetBSD$ */
 
 /*
  * System call argument lists.
@@ -1734,6 +1734,22 @@ struct sys___fhstat40_args {
        syscallarg(struct stat *) sb;
 };
 
+struct sys_pwritex_args {
+       syscallarg(int) fd;
+       syscallarg(int) flags;
+       syscallarg(const struct iovec *) iovp;
+       syscallarg(int) iovcnt;
+       syscallarg(off_t) offset;
+};
+
+struct sys_preadx_args {
+       syscallarg(int) fd;
+       syscallarg(int) flags;
+       syscallarg(const struct iovec *) iovp;
+       syscallarg(int) iovcnt;
+       syscallarg(off_t) offset;
+};
+
 /*
  * System call prototypes.
  */
@@ -2445,4 +2461,8 @@ int       sys___fhstatvfs140(struct lwp *, voi
 
 int    sys___fhstat40(struct lwp *, void *, register_t *);
 
+int    sys_pwritex(struct lwp *, void *, register_t *);
+
+int    sys_preadx(struct lwp *, void *, register_t *);
+
 #endif /* _SYS_SYSCALLARGS_H_ */
Index: sys/sys/uio.h
===================================================================
RCS file: /cvsroot/src/sys/sys/uio.h,v
retrieving revision 1.34
diff -p -u -r1.34 uio.h
--- sys/sys/uio.h       1 Mar 2006 12:38:32 -0000       1.34
+++ sys/sys/uio.h       1 Nov 2006 23:18:57 -0000
@@ -111,7 +111,9 @@ void uio_setup_sysspace(struct uio *);
 __BEGIN_DECLS
 #if defined(_NETBSD_SOURCE)
 ssize_t preadv(int, const struct iovec *, int, off_t);
+ssize_t preadx(int, int, const struct iovec *, int, off_t);
 ssize_t pwritev(int, const struct iovec *, int, off_t);
+ssize_t pwritex(int, int, const struct iovec *, int, off_t);
 #endif /* _NETBSD_SOURCE */
 ssize_t        readv(int, const struct iovec *, int);
 ssize_t        writev(int, const struct iovec *, int);
@@ -120,4 +122,10 @@ __END_DECLS
 int ureadc(int, struct uio *);
 #endif /* !_KERNEL */
 
+#define PXIO_FUA               0x0001  /* Assert FUA for i/o */
+#define PXIO_FUA_NV            0x0002  /* Assert FUA_NV for i/o */
+#define PXIO_FPOINTER          0x0004  /* Use file pointer, not offset */
+/* Following not implemented yet */
+#define PXIO_DIRECT            0x0008  /* Assert O_DIRECT for i/o */
+
 #endif /* !_SYS_UIO_H_ */
Index: sys/sys/vnode.h
===================================================================
RCS file: /cvsroot/src/sys/sys/vnode.h,v
retrieving revision 1.156
diff -p -u -r1.156 vnode.h
--- sys/sys/vnode.h     5 Oct 2006 14:48:33 -0000       1.156
+++ sys/sys/vnode.h     1 Nov 2006 23:18:58 -0000
@@ -234,7 +234,12 @@ struct vattr {
 #define        IO_ALTSEMANTICS 0x00400         /* use alternate i/o semantics 
*/
 #define        IO_NORMAL       0x00800         /* operate on regular data */
 #define        IO_EXT          0x01000         /* operate on extended 
attributes */
+<<<<<<< vnode.h
+#define        IO_FUA          0x02000         /* Set FUA on i/o */
+#define        IO_FUA_NV       0x04000         /* Set FUA_NV on i/o */
+=======
 #define        IO_DIRECT       0x02000         /* direct I/O hint */
+>>>>>>> 1.156
 #define        IO_ADV_MASK     0x00003         /* access pattern hint */
 
 #define        IO_ADV_SHIFT    0
Index: sys/ufs/ext2fs/ext2fs_readwrite.c
===================================================================
RCS file: /cvsroot/src/sys/ufs/ext2fs/ext2fs_readwrite.c,v
retrieving revision 1.43
diff -p -u -r1.43 ext2fs_readwrite.c
--- sys/ufs/ext2fs/ext2fs_readwrite.c   14 May 2006 21:32:21 -0000      1.43
+++ sys/ufs/ext2fs/ext2fs_readwrite.c   1 Nov 2006 23:18:58 -0000
@@ -330,17 +330,20 @@ ext2fs_write(void *v)
                         * XXXUBC simplistic async flushing.
                         */
 
-                       if (!async && oldoff >> 16 != uio->uio_offset >> 16) {
+                       if (!async && oldoff >> 16 != uio->uio_offset >> 16
+                           && ((ioflag & IO_FUA) == 0)) {
                                simple_lock(&vp->v_interlock);
                                error = VOP_PUTPAGES(vp, (oldoff >> 16) << 16,
                                    (uio->uio_offset >> 16) << 16, PGO_CLEANIT);
                        }
                }
                if (error == 0 && ioflag & IO_SYNC) {
+                       int f;
+                       f = (ioflag & IO_FUA) ? PGO_FUA : 0;
+                       f |= PGO_CLEANIT | PGO_SYNCIO;
                        simple_lock(&vp->v_interlock);
                        error = VOP_PUTPAGES(vp, trunc_page(oldoff),
-                           round_page(blkroundup(fs, uio->uio_offset)),
-                           PGO_CLEANIT | PGO_SYNCIO);
+                           round_page(blkroundup(fs, uio->uio_offset)), f);
                }
 
                goto out;
@@ -376,9 +379,12 @@ ext2fs_write(void *v)
                        extended = 1;
                }
 
+               if (ioflag & IO_FUA)
+                       bp->b_flags |= B_FUA;
                if (ioflag & IO_SYNC)
                        (void)bwrite(bp);
-               else if (xfersize + blkoffset == fs->e2fs_bsize)
+               else if ((xfersize + blkoffset == fs->e2fs_bsize)
+                   || (ioflag & IO_FUA))
                        bawrite(bp);
                else
                        bdwrite(bp);
Index: sys/ufs/lfs/lfs_vfsops.c
===================================================================
RCS file: /cvsroot/src/sys/ufs/lfs/lfs_vfsops.c,v
retrieving revision 1.222
diff -p -u -r1.222 lfs_vfsops.c
--- sys/ufs/lfs/lfs_vfsops.c    4 Oct 2006 15:56:46 -0000       1.222
+++ sys/ufs/lfs/lfs_vfsops.c    1 Nov 2006 23:19:00 -0000
@@ -1651,6 +1651,8 @@ lfs_gop_write(struct vnode *vp, struct v
        mbp->b_data = (void *)kva;
        mbp->b_resid = mbp->b_bcount = bytes;
        mbp->b_flags = B_BUSY|B_WRITE|B_AGE|B_CALL;
+       if (flags & PGO_FUA)
+               mbp->b_flags |= B_FUA;
        mbp->b_iodone = uvm_aio_biodone;
        mbp->b_vp = vp;
 
@@ -1715,6 +1717,8 @@ lfs_gop_write(struct vnode *vp, struct v
                            (vaddr_t)(offset - pg->offset);
                        bp->b_resid = bp->b_bcount = iobytes;
                        bp->b_flags = B_BUSY|B_WRITE|B_CALL;
+                       if (flags & PGO_FUA)
+                               mbp->b_flags |= B_FUA;
                        bp->b_iodone = uvm_aio_biodone1;
                }
 
Index: sys/ufs/ufs/ufs_readwrite.c
===================================================================
RCS file: /cvsroot/src/sys/ufs/ufs/ufs_readwrite.c,v
retrieving revision 1.70
diff -p -u -r1.70 ufs_readwrite.c
--- sys/ufs/ufs/ufs_readwrite.c 5 Oct 2006 14:48:33 -0000       1.70
+++ sys/ufs/ufs/ufs_readwrite.c 1 Nov 2006 23:19:00 -0000
@@ -277,7 +277,7 @@ WRITE(void *v)
                return (0);
 
        flags = ioflag & IO_SYNC ? B_SYNC : 0;
-       async = vp->v_mount->mnt_flag & MNT_ASYNC;
+       async = (vp->v_mount->mnt_flag & MNT_ASYNC);
        origoff = uio->uio_offset;
        resid = uio->uio_resid;
        osize = ip->i_size;
@@ -400,8 +400,13 @@ WRITE(void *v)
                 * XXXUBC simplistic async flushing.
                 */
 
+<<<<<<< ufs_readwrite.c
+               if (!async && oldoff >> 16 != uio->uio_offset >> 16
+                   && ((ioflag & IO_FUA) == 0)) {
+=======
 #ifndef LFS_READWRITE
                if (!async && oldoff >> 16 != uio->uio_offset >> 16) {
+>>>>>>> 1.70
                        simple_lock(&vp->v_interlock);
                        error = VOP_PUTPAGES(vp, (oldoff >> 16) << 16,
                            (uio->uio_offset >> 16) << 16, PGO_CLEANIT);
@@ -411,10 +416,12 @@ WRITE(void *v)
 #endif
        }
        if (error == 0 && ioflag & IO_SYNC) {
+               int     f;
+               f = (ioflag & IO_FUA) ? PGO_FUA : 0;
+               f |= PGO_CLEANIT | PGO_SYNCIO;
                simple_lock(&vp->v_interlock);
                error = VOP_PUTPAGES(vp, trunc_page(origoff & fs->fs_bmask),
-                   round_page(blkroundup(fs, uio->uio_offset)),
-                   PGO_CLEANIT | PGO_SYNCIO);
+                   round_page(blkroundup(fs, uio->uio_offset)), f);
        }
        goto out;
 
@@ -465,15 +472,18 @@ WRITE(void *v)
                        brelse(bp);
                        break;
                }
+               if (ioflag & IO_FUA)
+                       bp->b_flags |= B_FUA;
 #ifdef LFS_READWRITE
                (void)VOP_BWRITE(bp);
                lfs_reserve(fs, vp, NULL,
                    -btofsb(fs, (NIADDR + 1) << fs->lfs_bshift));
                need_unreserve = FALSE;
 #else
-               if (ioflag & IO_SYNC)
+               if (ioflag & IO_SYNC) {
                        (void)bwrite(bp);
-               else if (xfersize + blkoffset == fs->fs_bsize)
+               } else if ((xfersize + blkoffset == fs->fs_bsize)
+                   || (ioflag & IO_FUA))
                        bawrite(bp);
                else
                        bdwrite(bp);
Index: sys/uvm/uvm_pager.h
===================================================================
RCS file: /cvsroot/src/sys/uvm/uvm_pager.h,v
retrieving revision 1.34
diff -p -u -r1.34 uvm_pager.h
--- sys/uvm/uvm_pager.h 22 Feb 2006 22:28:18 -0000      1.34
+++ sys/uvm/uvm_pager.h 1 Nov 2006 23:19:00 -0000
@@ -162,6 +162,8 @@ struct uvm_pagerops {
 #define PGO_PASTEOF    0x400   /* allow allocation of pages past EOF */
 #define PGO_NOBLOCKALLOC 0x800 /* backing block allocation is not needed */
 #define PGO_NOTIMESTAMP 0x1000 /* don't mark object accessed/modified */
+#define PGO_FUA                0x2000  /* Pass FUA down on I/O calls */
+#define PGO_FUA_NV     0x4000  /* Pass FUA_NV down on I/O calls */
 
 /* page we are not interested in getting */
 #define PGO_DONTCARE ((struct vm_page *) -1L)  /* [get only] */

Attachment: pgpf4_fspkq53.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index