On Sun, Nov 02, 2008 at 01:50:25PM -0500, Thor Lancelot Simon wrote: > On Sun, Nov 02, 2008 at 07:57:37AM -0800, Bill Stouder-Studenmund wrote: > > > > So Thor, why are you so entrenched in this? If you're going to add a bit, > > just add FUA. It does exactly what you want. It was designed to do what > > you want. > > We can't enqueue new commands while waiting for a cache flush to complete. For SCSI systems, SBC does not indicate that this is a requirement. We should be able to continue enqueueing new commands while the cache is flushing. > We *can* enqueue new commands while waiting for an ordered tag (and all > its prior simple tags) to complete. This seems like a huge advantage to > me. I also see other uses for disk write barriers beyond just solving > WAPBL's immediate problem. It depends. How much i/o are we waiting for and how much is blocked waiting? > The other thing is that if you look at other journalled filesystems > (particularly XFS) they manage to use write barriers to not do every > journal write synchronously (okay, pseudo-synchronously). Given that we > don't support an external journal this seems particularly interesting to > me. You're still using the wrong tool for things. Another thing to look at would be to see how drives have supported FUA over time. Both tagged queuing and FUA came in with SCSI-2. All the tags are, as best I can tell from reading the original spec, required if you support tagged queuing. They were optional if you didn't support tagged queuing. FUA (and DPO) were also optional. But I think tagged queuing was supported sooner since it was a notable performance improvement. The point I'm trying to circle is that things like XFS may well have been written back at a time when tagged queuing was much more likely to be supported (and correctly supported) than FUA. My experience as a target maker was that Windows uses FUA but we never saw ORDERED tags out of it. > Also, it is clear to me how to implement B_BARRIER -- I know exactly what > to modify and how, except for the disk sorting code, which I'm reasonably > confident I can figure out. I do not know how to implement FUA for each > kind of disk in the system -- if you do, by all means, be my guest! This I understand! To be honest, I think the things you're describing are what we should do if FUA doesn't work on a drive. I'm attaching diffs (from 2006, sorry!) that implement FUA and also the writex() command I talked about back at that time. Sadly, there also are some conflicts in the code. But they should be easy to sort out. So my suggestion is do this, then if a device doesn't support FUA, fall back to either the synchronize cache or B_BARRIER things you're describing. Take care, Bill
? .BUFPATHS.swp ? BUFPATHS ? TODO ? TheTools ? cvslog ? diffie.20060928 ? lib/libc/sys/pwritex.c ? sys/arch/i386/compile/GENERIC Index: sys/dev/ld.c =================================================================== RCS file: /cvsroot/src/sys/dev/ld.c,v retrieving revision 1.40 diff -p -u -r1.40 ld.c --- sys/dev/ld.c 28 Mar 2006 17:38:29 -0000 1.40 +++ sys/dev/ld.c 1 Nov 2006 23:18:38 -0000 @@ -66,6 +66,7 @@ __KERNEL_RCSID(0, "$NetBSD: ld.c,v 1.40 #if NRND > 0 #include <sys/rnd.h> #endif +#include <miscfs/specfs/specdev.h> #include <dev/ldvar.h> @@ -361,16 +362,28 @@ ldclose(dev_t dev, int flags, int fmt, s static int ldread(dev_t dev, struct uio *uio, int ioflag) { + int f = B_READ; - return (physio(ldstrategy, NULL, dev, B_READ, ldminphys, uio)); + if (ioflag & IO_FUA) + f |= B_FUA; + if (ioflag & IO_FUA_NV) + f |= B_FUA_NV; + + return (physio(ldstrategy, NULL, dev, f, ldminphys, uio)); } /* ARGSUSED */ static int ldwrite(dev_t dev, struct uio *uio, int ioflag) { + int f = B_WRITE; + + if (ioflag & IO_FUA) + f |= B_FUA; + if (ioflag & IO_FUA_NV) + f |= B_FUA_NV; - return (physio(ldstrategy, NULL, dev, B_WRITE, ldminphys, uio)); + return (physio(ldstrategy, NULL, dev, f, ldminphys, uio)); } /* ARGSUSED */ @@ -523,6 +536,11 @@ ldioctl(dev_t dev, u_long cmd, caddr_t a return (dkwedge_list(&sc->sc_dk, dkwl, l)); } + case DIOCGCAPS: + if (sc->sc_flags & LDF_CAN_FUA) + *(int *) addr = V_CAP_CAN_FUA; + break; + default: error = ENOTTY; break; Index: sys/dev/ldvar.h =================================================================== RCS file: /cvsroot/src/sys/dev/ldvar.h,v retrieving revision 1.11 diff -p -u -r1.11 ldvar.h --- sys/dev/ldvar.h 11 Dec 2005 12:20:53 -0000 1.11 +++ sys/dev/ldvar.h 1 Nov 2006 23:18:38 -0000 @@ -72,6 +72,7 @@ struct ld_softc { #define LDF_DETACH 0x040 /* detach pending */ #define LDF_KLABEL 0x080 /* keep label on close */ #define LDF_VLABEL 0x100 /* label is valid */ +#define LDF_CAN_FUA 0x200 /* Device handles FUA */ int ldadjqparam(struct ld_softc *, int); void ldattach(struct ld_softc *); Index: sys/dev/scsipi/sd.c =================================================================== RCS file: /cvsroot/src/sys/dev/scsipi/sd.c,v retrieving revision 1.250 diff -p -u -r1.250 sd.c --- sys/dev/scsipi/sd.c 14 Sep 2006 17:54:34 -0000 1.250 +++ sys/dev/scsipi/sd.c 1 Nov 2006 23:18:42 -0000 @@ -80,6 +80,7 @@ __KERNEL_RCSID(0, "$NetBSD: sd.c,v 1.250 #if NRND > 0 #include <sys/rnd.h> #endif +#include <miscfs/specfs/specdev.h> #include <dev/scsipi/scsi_spc.h> #include <dev/scsipi/scsipi_all.h> @@ -828,11 +829,13 @@ sdstart(struct scsipi_periph *periph) /* * Fill out the scsi command. Use the smallest CDB possible - * (6-byte, 10-byte, or 16-byte). + * (6-byte, 10-byte, or 16-byte). FUA commands have to + * use 10-byte or 16-byte. */ if (((bp->b_rawblkno & 0x1fffff) == bp->b_rawblkno) && ((nblks & 0xff) == nblks) && - !(periph->periph_quirks & PQUIRK_ONLYBIG)) { + !(periph->periph_quirks & PQUIRK_ONLYBIG) && + !(bp->b_flags & (B_FUA | B_FUA_NV))) { /* 6-byte CDB */ memset(&cmd_small, 0, sizeof(cmd_small)); cmd_small.opcode = (bp->b_flags & B_READ) ? @@ -846,6 +849,10 @@ sdstart(struct scsipi_periph *periph) memset(&cmd_big, 0, sizeof(cmd_big)); cmd_big.opcode = (bp->b_flags & B_READ) ? READ_10 : WRITE_10; + if (bp->b_flags & B_FUA) + cmd_big.byte2 |= SRWB_FUA; + if (bp->b_flags & B_FUA_NV) + cmd_big.byte2 |= SRWB_FUA_NV; _lto4b(bp->b_rawblkno, cmd_big.addr); _lto2b(nblks, cmd_big.length); cmdlen = sizeof(cmd_big); @@ -855,6 +862,10 @@ sdstart(struct scsipi_periph *periph) memset(&cmd16, 0, sizeof(cmd16)); cmd16.opcode = (bp->b_flags & B_READ) ? READ_16 : WRITE_16; + if (bp->b_flags & B_FUA) + cmd_big.byte2 |= SRWB_FUA; + if (bp->b_flags & B_FUA_NV) + cmd_big.byte2 |= SRWB_FUA_NV; _lto8b(bp->b_rawblkno, cmd16.addr); _lto4b(nblks, cmd16.length); cmdlen = sizeof(cmd16); @@ -983,15 +994,27 @@ sdminphys(struct buf *bp) static int sdread(dev_t dev, struct uio *uio, int ioflag) { + int f = B_READ; - return (physio(sdstrategy, NULL, dev, B_READ, sdminphys, uio)); + if (ioflag & IO_FUA) + f |= B_FUA; + if (ioflag & IO_FUA_NV) + f |= B_FUA_NV; + + return (physio(sdstrategy, NULL, dev, f, sdminphys, uio)); } static int sdwrite(dev_t dev, struct uio *uio, int ioflag) { + int f = B_WRITE; + + if (ioflag & IO_FUA) + f |= B_FUA; + if (ioflag & IO_FUA_NV) + f |= B_FUA_NV; - return (physio(sdstrategy, NULL, dev, B_WRITE, sdminphys, uio)); + return (physio(sdstrategy, NULL, dev, f, sdminphys, uio)); } /* @@ -1239,6 +1262,11 @@ bad: return (dkwedge_list(&sd->sc_dk, dkwl, l)); } + case DIOCGCAPS: + /* Report capabilities */ + *(int *) addr = V_CAP_CAN_FUA; + return 0; + default: if (part != RAW_PART) return (ENOTTY); Index: sys/kern/init_sysent.c =================================================================== RCS file: /cvsroot/src/sys/kern/init_sysent.c,v retrieving revision 1.181 diff -p -u -r1.181 init_sysent.c --- sys/kern/init_sysent.c 1 Sep 2006 21:04:45 -0000 1.181 +++ sys/kern/init_sysent.c 1 Nov 2006 23:18:45 -0000 @@ -1,4 +1,4 @@ -/* $NetBSD: init_sysent.c,v 1.181 2006/09/01 21:04:45 matt Exp $ */ +/* $NetBSD$ */ /* * System call switch table. @@ -8,7 +8,7 @@ */ #include <sys/cdefs.h> -__KERNEL_RCSID(0, "$NetBSD: init_sysent.c,v 1.181 2006/09/01 21:04:45 matt Exp $"); +__KERNEL_RCSID(0, "$NetBSD$"); #include "opt_ktrace.h" #include "opt_nfsserver.h" @@ -1048,10 +1048,10 @@ struct sysent sysent[] = { sys___fhstatvfs140 }, /* 397 = __fhstatvfs140 */ { 3, s(struct sys___fhstat40_args), 0, sys___fhstat40 }, /* 398 = __fhstat40 */ - { 0, 0, 0, - sys_nosys }, /* 399 = filler */ - { 0, 0, 0, - sys_nosys }, /* 400 = filler */ + { 5, s(struct sys_pwritex_args), 0, + sys_pwritex }, /* 399 = pwritex */ + { 5, s(struct sys_preadx_args), 0, + sys_preadx }, /* 400 = preadx */ { 0, 0, 0, sys_nosys }, /* 401 = filler */ { 0, 0, 0, Index: sys/kern/kern_physio.c =================================================================== RCS file: /cvsroot/src/sys/kern/kern_physio.c,v retrieving revision 1.74 diff -p -u -r1.74 kern_physio.c --- sys/kern/kern_physio.c 5 Oct 2006 14:48:32 -0000 1.74 +++ sys/kern/kern_physio.c 1 Nov 2006 23:18:45 -0000 @@ -287,7 +287,7 @@ physio(void (*strategy)(struct buf *), s DPRINTF(("%s: called: off=%" PRIu64 ", resid=%zu\n", __func__, uio->uio_offset, uio->uio_resid)); - flags &= B_READ | B_WRITE; + flags &= B_READ | B_WRITE | B_FUA | B_FUA_NV; /* Make sure we have a buffer, creating one if necessary. */ if (obp != NULL) { Index: sys/kern/syscalls.c =================================================================== RCS file: /cvsroot/src/sys/kern/syscalls.c,v retrieving revision 1.177 diff -p -u -r1.177 syscalls.c --- sys/kern/syscalls.c 1 Sep 2006 22:23:18 -0000 1.177 +++ sys/kern/syscalls.c 1 Nov 2006 23:18:45 -0000 @@ -1,4 +1,4 @@ -/* $NetBSD: syscalls.c,v 1.177 2006/09/01 22:23:18 matt Exp $ */ +/* $NetBSD$ */ /* * System call names. @@ -8,7 +8,7 @@ */ #include <sys/cdefs.h> -__KERNEL_RCSID(0, "$NetBSD: syscalls.c,v 1.177 2006/09/01 22:23:18 matt Exp $"); +__KERNEL_RCSID(0, "$NetBSD$"); #if defined(_KERNEL_OPT) #include "opt_ktrace.h" @@ -534,4 +534,6 @@ const char *const syscallnames[] = { "__fhopen40", /* 396 = __fhopen40 */ "__fhstatvfs140", /* 397 = __fhstatvfs140 */ "__fhstat40", /* 398 = __fhstat40 */ + "pwritex", /* 399 = pwritex */ + "preadx", /* 400 = preadx */ }; Index: sys/kern/syscalls.master =================================================================== RCS file: /cvsroot/src/sys/kern/syscalls.master,v retrieving revision 1.159 diff -p -u -r1.159 syscalls.master --- sys/kern/syscalls.master 1 Sep 2006 20:58:18 -0000 1.159 +++ sys/kern/syscalls.master 1 Nov 2006 23:18:46 -0000 @@ -797,3 +797,9 @@ size_t fh_size, struct statvfs *buf, int flags); } 398 STD { int sys___fhstat40(const void *fhp, \ size_t fh_size, struct stat *sb); } +399 STD { ssize_t sys_pwritex(int fd, int flags, \ + const struct iovec *iovp, int iovcnt, \ + off_t offset); } +400 STD { ssize_t sys_preadx(int fd, int flags, \ + const struct iovec *iovp, int iovcnt, \ + off_t offset); } Index: sys/kern/vfs_bio.c =================================================================== RCS file: /cvsroot/src/sys/kern/vfs_bio.c,v retrieving revision 1.163 diff -p -u -r1.163 vfs_bio.c --- sys/kern/vfs_bio.c 10 Sep 2006 06:35:42 -0000 1.163 +++ sys/kern/vfs_bio.c 1 Nov 2006 23:18:47 -0000 @@ -706,11 +706,13 @@ bwrite(struct buf *bp) * Remember buffer type, to switch on it later. If the write was * synchronous, but the file system was mounted with MNT_ASYNC, * convert it to a delayed write. + * FUA writes remain synchronous. * XXX note that this relies on delayed tape writes being converted * to async, not sync writes (which is safe, but ugly). */ sync = !ISSET(bp->b_flags, B_ASYNC); - if (sync && mp != NULL && ISSET(mp->mnt_flag, MNT_ASYNC)) { + if (sync && mp != NULL && ISSET(mp->mnt_flag, MNT_ASYNC) + && !ISSET(bp->b_flags, B_FUA | B_FUA_NV)) { bdwrite(bp); return (0); } @@ -1090,6 +1092,7 @@ start: allocbuf(bp, size, preserve); } BIO_SETPRIO(bp, BPRIO_DEFAULT); + CLR(bp->b_flags, B_FUA | B_FUA_NV); return (bp); } @@ -1830,6 +1833,10 @@ nestiobuf_setup(struct buf *mbp, struct KASSERT(mbp->b_bcount >= offset + size); bp->b_vp = vp; bp->b_flags = B_BUSY | B_CALL | B_ASYNC | b_read; + if (mbp->b_flags & B_FUA) + bp->b_flags |= B_FUA; + if (mbp->b_flags & B_FUA_NV) + bp->b_flags |= B_FUA_NV; bp->b_iodone = nestiobuf_iodone; bp->b_data = mbp->b_data + offset; bp->b_resid = bp->b_bcount = size; Index: sys/kern/vfs_syscalls.c =================================================================== RCS file: /cvsroot/src/sys/kern/vfs_syscalls.c,v retrieving revision 1.270 diff -p -u -r1.270 vfs_syscalls.c --- sys/kern/vfs_syscalls.c 13 Sep 2006 10:07:42 -0000 1.270 +++ sys/kern/vfs_syscalls.c 1 Nov 2006 23:18:50 -0000 @@ -2265,6 +2265,83 @@ sys_preadv(struct lwp *l, void *v, regis } /* + * Positional scatter read system call. + */ +int +sys_preadx(struct lwp *l, void *v, register_t *retval) +{ + struct sys_preadx_args /* { + syscallarg(int) fd; + syscallarg(int) flags; + syscallarg(const struct iovec *) iovp; + syscallarg(int) iovcnt; + syscallarg(off_t) offset; + } */ *uap = v; + struct proc *p = l->l_proc; + struct filedesc *fdp = p->p_fd; + struct file *fp; + struct vnode *vp; + off_t offset, *ofp; + int error, fd, flags, f1; + + fd = SCARG(uap, fd); + flags = SCARG(uap, flags); + f1 = 0; + + if ((fp = fd_getfile(fdp, fd)) == NULL) + return (EBADF); + + if ((fp->f_flag & FREAD) == 0) { + simple_unlock(&fp->f_slock); + return (EBADF); + } + + FILE_USE(fp); + + vp = (struct vnode *)fp->f_data; + if (fp->f_type != DTYPE_VNODE || vp->v_type == VFIFO) { + error = ESPIPE; + goto out; + } + + offset = SCARG(uap, offset); + + if (flags & PXIO_FPOINTER) { + /* Ok, someone wants to update the file pointer. Oh well. */ + ofp = &fp->f_offset; + f1 = FOF_UPDATE_OFFSET; + } else { + offset = SCARG(uap, offset); + ofp = &offset; + f1 = 0; + /* + * XXX This works because no file systems actually + * XXX take any action on the seek operation. + */ + if ((error = VOP_SEEK(vp, fp->f_offset, offset, fp->f_cred))) + goto out; + } + + if (flags & PXIO_FUA) + f1 |= FOF_FUA; + if (flags & PXIO_FUA_NV) + f1 |= FOF_FUA_NV; + +#if 0 /* Not yet */ + if (flags & PXIO_DIRECT) + f1 |= XXX; +#endif + + /* dofilereadv() will unuse the descriptor for us */ + return (dofilereadv(l, fd, fp, SCARG(uap, iovp), SCARG(uap, iovcnt), + ofp, f1, retval)); + + out: + FILE_UNUSE(fp, l); + return (error); +} + +/* * Positional write system call. */ int @@ -2371,6 +2448,81 @@ sys_pwritev(struct lwp *l, void *v, regi } /* + * Positional gather write with flags system call. + */ +int +sys_pwritex(struct lwp *l, void *v, register_t *retval) +{ + struct sys_pwritex_args /* { + syscallarg(int) fd; + syscallarg(int) flags; + syscallarg(const struct iovec *) iovp; + syscallarg(int) iovcnt; + syscallarg(off_t) offset; + } */ *uap = v; + struct proc *p = l->l_proc; + struct filedesc *fdp = p->p_fd; + struct file *fp; + struct vnode *vp; + off_t offset, *ofp; + int error, fd, flags, f1; + + fd = SCARG(uap, fd); + flags = SCARG(uap, flags); + f1 = 0; + + if ((fp = fd_getfile(fdp, fd)) == NULL) + return (EBADF); + + if ((fp->f_flag & FWRITE) == 0) { + simple_unlock(&fp->f_slock); + return (EBADF); + } + + FILE_USE(fp); + + vp = (struct vnode *)fp->f_data; + if (fp->f_type != DTYPE_VNODE || vp->v_type == VFIFO) { + error = ESPIPE; + goto out; + } + + if (flags & PXIO_FPOINTER) { + /* Ok, someone wants to update the file pointer. Oh well. */ + ofp = &fp->f_offset; + f1 = FOF_UPDATE_OFFSET; + } else { + offset = SCARG(uap, offset); + ofp = &offset; + f1 = 0; + /* + * XXX This works because no file systems actually + * XXX take any action on the seek operation. + */ + if ((error = VOP_SEEK(vp, fp->f_offset, offset, fp->f_cred))) + goto out; + } + + if (flags & PXIO_FUA) + f1 |= FOF_FUA; + if (flags & PXIO_FUA_NV) + f1 |= FOF_FUA_NV; + +#if 0 /* Not yet */ + if (flags & PXIO_DIRECT) + f1 |= XXX; +#endif + + /* dofilewritev() will unuse the descriptor for us */ + return (dofilewritev(l, fd, fp, SCARG(uap, iovp), SCARG(uap, iovcnt), + ofp, f1, retval)); + + out: + FILE_UNUSE(fp, l); + return (error); +} + +/* * Check access permissions. */ int Index: sys/kern/vfs_vnops.c =================================================================== RCS file: /cvsroot/src/sys/kern/vfs_vnops.c,v retrieving revision 1.125 diff -p -u -r1.125 vfs_vnops.c --- sys/kern/vfs_vnops.c 5 Oct 2006 14:48:32 -0000 1.125 +++ sys/kern/vfs_vnops.c 1 Nov 2006 23:18:50 -0000 @@ -491,14 +491,29 @@ vn_read(struct file *fp, off_t *offset, ioflag |= IO_SYNC; if (fp->f_flag & FALTIO) ioflag |= IO_ALTSEMANTICS; +<<<<<<< vfs_vnops.c + if (flags & FOF_FUA_NV) + ioflag |= IO_FUA_NV | IO_SYNC; + if (flags & FOF_FUA) + ioflag |= IO_FUA | IO_SYNC; +======= if (fp->f_flag & FDIRECT) ioflag |= IO_DIRECT; +>>>>>>> 1.125 vn_lock(vp, LK_SHARED | LK_RETRY); uio->uio_offset = *offset; count = uio->uio_resid; + if (flags & FOF_FUA) { + /* Purge any existing pages from the uvm cache */ + error = VOP_PUTPAGES(vp, *offset, *offset + count, + PGO_FREE | PGO_SYNCIO | PGO_CLEANIT); + if (error) + goto out; + } error = VOP_READ(vp, uio, ioflag, cred); if (flags & FOF_UPDATE_OFFSET) *offset += count - uio->uio_resid; +out: VOP_UNLOCK(vp, 0); return (error); } @@ -526,8 +541,15 @@ vn_write(struct file *fp, off_t *offset, ioflag |= IO_DSYNC; if (fp->f_flag & FALTIO) ioflag |= IO_ALTSEMANTICS; +<<<<<<< vfs_vnops.c + if (flags & FOF_FUA_NV) + ioflag |= IO_FUA_NV | IO_SYNC; + if (flags & FOF_FUA) + ioflag |= IO_FUA | IO_SYNC; +======= if (fp->f_flag & FDIRECT) ioflag |= IO_DIRECT; +>>>>>>> 1.125 mp = NULL; if (vp->v_type != VCHR && (error = vn_start_write(vp, &mp, V_WAIT | V_PCATCH)) != 0) Index: sys/miscfs/genfs/genfs_vnops.c =================================================================== RCS file: /cvsroot/src/sys/miscfs/genfs/genfs_vnops.c,v retrieving revision 1.130 diff -p -u -r1.130 genfs_vnops.c --- sys/miscfs/genfs/genfs_vnops.c 5 Oct 2006 14:48:32 -0000 1.130 +++ sys/miscfs/genfs/genfs_vnops.c 1 Nov 2006 23:18:53 -0000 @@ -1539,8 +1539,15 @@ genfs_do_io(struct vnode *vp, off_t off, mbp->b_bufsize = len; mbp->b_data = (void *)kva; mbp->b_resid = mbp->b_bcount = bytes; +<<<<<<< genfs_vnops.c + mbp->b_flags = B_BUSY|B_WRITE|B_AGE| (async ? (B_CALL|B_ASYNC) : 0); + if (flags & PGO_FUA) + mbp->b_flags |= B_FUA; + mbp->b_iodone = uvm_aio_biodone; +======= mbp->b_flags = B_BUSY | brw | B_AGE | (async ? (B_CALL | B_ASYNC) : 0); mbp->b_iodone = iodone; +>>>>>>> 1.130 mbp->b_vp = vp; if (curproc == uvm.pagedaemon_proc) BIO_SETPRIO(mbp, BPRIO_TIMELIMITED); Index: sys/miscfs/specfs/spec_vnops.c =================================================================== RCS file: /cvsroot/src/sys/miscfs/specfs/spec_vnops.c,v retrieving revision 1.92 diff -p -u -r1.92 spec_vnops.c --- sys/miscfs/specfs/spec_vnops.c 30 Sep 2006 21:00:13 -0000 1.92 +++ sys/miscfs/specfs/spec_vnops.c 1 Nov 2006 23:18:53 -0000 @@ -303,6 +303,9 @@ spec_open(v) return error; if (!(*d_ioctl)(vp->v_rdev, DIOCGPART, (caddr_t)&pi, FREAD, curlwp)) vp->v_size = (voff_t)pi.disklab->d_secsize * pi.part->p_size; + vp->v_spec_cap = 0; + (*d_ioctl)(vp->v_rdev, DIOCGCAPS, (caddr_t)&vp->v_spec_cap, FREAD, + curlwp); return 0; } @@ -426,6 +429,9 @@ spec_write(v) switch (vp->v_type) { case VCHR: + if ((ap->a_ioflag & (IO_FUA | IO_FUA_NV)) + && (~vp->v_spec_cap & V_CAP_CAN_FUA)) + return (ENODEV); VOP_UNLOCK(vp, 0); cdev = cdevsw_lookup(vp->v_rdev); if (cdev != NULL) @@ -440,6 +446,9 @@ spec_write(v) return (0); if (uio->uio_offset < 0) return (EINVAL); + if ((ap->a_ioflag & (IO_FUA | IO_FUA_NV)) + && (~vp->v_spec_cap & V_CAP_CAN_FUA)) + return (ENODEV); bsize = BLKDEV_IOSIZE; bdev = bdevsw_lookup(vp->v_rdev); if (bdev != NULL && @@ -468,7 +477,14 @@ spec_write(v) if (error) brelse(bp); else { - if (n + on == bsize) + if (ap->a_ioflag & (IO_FUA | IO_FUA_NV)) { + if (ap->a_ioflag & IO_FUA) + SET(bp->b_flags, B_FUA); + if (ap->a_ioflag & IO_FUA_NV) + SET(bp->b_flags, B_FUA_NV); + SET(bp->b_flags, B_SYNC); + bwrite(bp); + } else if (n + on == bsize) bawrite(bp); else bdwrite(bp); @@ -662,6 +678,13 @@ spec_strategy(v) error = 0; bp->b_dev = vp->v_rdev; + if ((bp->b_flags & (B_FUA | B_FUA_NV)) + && (~vp->v_spec_cap & V_CAP_CAN_FUA)) { + /* Reject B_FUA if device can't do it */ + error = ENODEV; + goto error_out; + } + if (!(bp->b_flags & B_READ) && (LIST_FIRST(&bp->b_dep)) != NULL && bioops.io_start) (*bioops.io_start)(bp); @@ -686,16 +709,18 @@ spec_strategy(v) SPEC_COW_UNLOCK(vp->v_specinfo, s); } - if (error) { - bp->b_error = error; - bp->b_flags |= B_ERROR; - biodone(bp); - return (error); - } + if (error) + goto error_out; DEV_STRATEGY(bp); return (0); + +error_out: + bp->b_error = error; + bp->b_flags |= B_ERROR; + biodone(bp); + return (error); } int Index: sys/miscfs/specfs/specdev.h =================================================================== RCS file: /cvsroot/src/sys/miscfs/specfs/specdev.h,v retrieving revision 1.30 diff -p -u -r1.30 specdev.h --- sys/miscfs/specfs/specdev.h 14 May 2006 21:32:21 -0000 1.30 +++ sys/miscfs/specfs/specdev.h 1 Nov 2006 23:18:53 -0000 @@ -49,6 +49,7 @@ struct specinfo { struct vnode *si_specnext; struct mount *si_mountpoint; dev_t si_rdev; + int si_cap; struct lockf *si_lockf; struct simplelock si_cow_slock; SLIST_HEAD(, spec_cow_entry) si_cow_head; @@ -63,6 +64,7 @@ struct specinfo { #define v_specnext v_specinfo->si_specnext #define v_speclockf v_specinfo->si_lockf #define v_specmountpoint v_specinfo->si_mountpoint +#define v_spec_cap v_specinfo->si_cap #define v_spec_cow_slock v_specinfo->si_cow_slock #define v_spec_cow_head v_specinfo->si_cow_head #define v_spec_cow_req v_specinfo->si_cow_req @@ -81,6 +83,14 @@ struct specinfo { } while (/*CONSTCOND*/0) /* + * Device capabilites + * + * spec_open clears capabilities then calls the device's ioctl routine + * (via DIOCGCAPS) to fill in supported/required capabilities. + */ +#define V_CAP_CAN_FUA 0x0001 /* Device honors FUA flag */ + +/* * Special device management */ #define SPECHSZ 64 Index: sys/sys/buf.h =================================================================== RCS file: /cvsroot/src/sys/sys/buf.h,v retrieving revision 1.89 diff -p -u -r1.89 buf.h --- sys/sys/buf.h 10 Sep 2006 06:35:42 -0000 1.89 +++ sys/sys/buf.h 1 Nov 2006 23:18:54 -0000 @@ -213,11 +213,14 @@ do { \ #define B_WRITE 0x00000000 /* Write buffer (pseudo flag). */ #define B_XXX 0x02000000 /* Debugging flag. */ #define B_VFLUSH 0x04000000 /* Buffer is being synced. */ +#define B_FUA 0x08000000 /* Force Unit Access on i/o. */ +#define B_FUA_NV 0x10000000 /* Force Unit Access on i/o. */ #define BUF_FLAGBITS \ "\20\1AGE\3ASYNC\4BAD\5BUSY\6SCANNED\7CALL\10DELWRI" \ "\11DIRTY\12DONE\14ERROR\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \ - "\22CACHE\23PHYS\24RAW\25READ\26TAPE\30WANTED\32XXX\33VFLUSH" + "\22CACHE\23PHYS\24RAW\25READ\26TAPE\30WANTED\32XXX\33VFLUSH" \ + "\34FUA\35FUA_NV" /* Index: sys/sys/dkio.h =================================================================== RCS file: /cvsroot/src/sys/sys/dkio.h,v retrieving revision 1.12 diff -p -u -r1.12 dkio.h --- sys/sys/dkio.h 26 Dec 2005 10:36:47 -0000 1.12 +++ sys/sys/dkio.h 1 Nov 2006 23:18:54 -0000 @@ -98,4 +98,7 @@ #define DIOCGSTRATEGY _IOR('d', 125, struct disk_strategy) #define DIOCSSTRATEGY _IOW('d', 126, struct disk_strategy) + /* device capabilities, enumerated in specdev.h */ +#define DIOCGCAPS _IOR('d', 127, int) + #endif /* _SYS_DKIO_H_ */ Index: sys/sys/file.h =================================================================== RCS file: /cvsroot/src/sys/sys/file.h,v retrieving revision 1.56 diff -p -u -r1.56 file.h --- sys/sys/file.h 14 May 2006 21:38:18 -0000 1.56 +++ sys/sys/file.h 1 Nov 2006 23:18:55 -0000 @@ -149,6 +149,8 @@ do { \ * Flags for fo_read and fo_write. */ #define FOF_UPDATE_OFFSET 0x01 /* update the file offset */ +#define FOF_FUA 0x02 /* operation req FUA */ +#define FOF_FUA_NV 0x04 /* operation req FUA */ LIST_HEAD(filelist, file); extern struct filelist filehead; /* head of list of open files */ Index: sys/sys/syscall.h =================================================================== RCS file: /cvsroot/src/sys/sys/syscall.h,v retrieving revision 1.174 diff -p -u -r1.174 syscall.h --- sys/sys/syscall.h 1 Sep 2006 21:04:45 -0000 1.174 +++ sys/sys/syscall.h 1 Nov 2006 23:18:55 -0000 @@ -1,4 +1,4 @@ -/* $NetBSD: syscall.h,v 1.174 2006/09/01 21:04:45 matt Exp $ */ +/* $NetBSD$ */ /* * System call numbers. @@ -1100,6 +1100,12 @@ /* syscall: "__fhstat40" ret: "int" args: "const void *" "size_t" "struct stat *" */ #define SYS___fhstat40 398 -#define SYS_MAXSYSCALL 399 +/* syscall: "pwritex" ret: "ssize_t" args: "int" "int" "const struct iovec *" "int" "off_t" */ +#define SYS_pwritex 399 + +/* syscall: "preadx" ret: "ssize_t" args: "int" "int" "const struct iovec *" "int" "off_t" */ +#define SYS_preadx 400 + +#define SYS_MAXSYSCALL 401 #define SYS_NSYSENT 512 #endif /* _SYS_SYSCALL_H_ */ Index: sys/sys/syscallargs.h =================================================================== RCS file: /cvsroot/src/sys/sys/syscallargs.h,v retrieving revision 1.156 diff -p -u -r1.156 syscallargs.h --- sys/sys/syscallargs.h 1 Sep 2006 21:04:45 -0000 1.156 +++ sys/sys/syscallargs.h 1 Nov 2006 23:18:57 -0000 @@ -1,4 +1,4 @@ -/* $NetBSD: syscallargs.h,v 1.156 2006/09/01 21:04:45 matt Exp $ */ +/* $NetBSD$ */ /* * System call argument lists. @@ -1734,6 +1734,22 @@ struct sys___fhstat40_args { syscallarg(struct stat *) sb; }; +struct sys_pwritex_args { + syscallarg(int) fd; + syscallarg(int) flags; + syscallarg(const struct iovec *) iovp; + syscallarg(int) iovcnt; + syscallarg(off_t) offset; +}; + +struct sys_preadx_args { + syscallarg(int) fd; + syscallarg(int) flags; + syscallarg(const struct iovec *) iovp; + syscallarg(int) iovcnt; + syscallarg(off_t) offset; +}; + /* * System call prototypes. */ @@ -2445,4 +2461,8 @@ int sys___fhstatvfs140(struct lwp *, voi int sys___fhstat40(struct lwp *, void *, register_t *); +int sys_pwritex(struct lwp *, void *, register_t *); + +int sys_preadx(struct lwp *, void *, register_t *); + #endif /* _SYS_SYSCALLARGS_H_ */ Index: sys/sys/uio.h =================================================================== RCS file: /cvsroot/src/sys/sys/uio.h,v retrieving revision 1.34 diff -p -u -r1.34 uio.h --- sys/sys/uio.h 1 Mar 2006 12:38:32 -0000 1.34 +++ sys/sys/uio.h 1 Nov 2006 23:18:57 -0000 @@ -111,7 +111,9 @@ void uio_setup_sysspace(struct uio *); __BEGIN_DECLS #if defined(_NETBSD_SOURCE) ssize_t preadv(int, const struct iovec *, int, off_t); +ssize_t preadx(int, int, const struct iovec *, int, off_t); ssize_t pwritev(int, const struct iovec *, int, off_t); +ssize_t pwritex(int, int, const struct iovec *, int, off_t); #endif /* _NETBSD_SOURCE */ ssize_t readv(int, const struct iovec *, int); ssize_t writev(int, const struct iovec *, int); @@ -120,4 +122,10 @@ __END_DECLS int ureadc(int, struct uio *); #endif /* !_KERNEL */ +#define PXIO_FUA 0x0001 /* Assert FUA for i/o */ +#define PXIO_FUA_NV 0x0002 /* Assert FUA_NV for i/o */ +#define PXIO_FPOINTER 0x0004 /* Use file pointer, not offset */ +/* Following not implemented yet */ +#define PXIO_DIRECT 0x0008 /* Assert O_DIRECT for i/o */ + #endif /* !_SYS_UIO_H_ */ Index: sys/sys/vnode.h =================================================================== RCS file: /cvsroot/src/sys/sys/vnode.h,v retrieving revision 1.156 diff -p -u -r1.156 vnode.h --- sys/sys/vnode.h 5 Oct 2006 14:48:33 -0000 1.156 +++ sys/sys/vnode.h 1 Nov 2006 23:18:58 -0000 @@ -234,7 +234,12 @@ struct vattr { #define IO_ALTSEMANTICS 0x00400 /* use alternate i/o semantics */ #define IO_NORMAL 0x00800 /* operate on regular data */ #define IO_EXT 0x01000 /* operate on extended attributes */ +<<<<<<< vnode.h +#define IO_FUA 0x02000 /* Set FUA on i/o */ +#define IO_FUA_NV 0x04000 /* Set FUA_NV on i/o */ +======= #define IO_DIRECT 0x02000 /* direct I/O hint */ +>>>>>>> 1.156 #define IO_ADV_MASK 0x00003 /* access pattern hint */ #define IO_ADV_SHIFT 0 Index: sys/ufs/ext2fs/ext2fs_readwrite.c =================================================================== RCS file: /cvsroot/src/sys/ufs/ext2fs/ext2fs_readwrite.c,v retrieving revision 1.43 diff -p -u -r1.43 ext2fs_readwrite.c --- sys/ufs/ext2fs/ext2fs_readwrite.c 14 May 2006 21:32:21 -0000 1.43 +++ sys/ufs/ext2fs/ext2fs_readwrite.c 1 Nov 2006 23:18:58 -0000 @@ -330,17 +330,20 @@ ext2fs_write(void *v) * XXXUBC simplistic async flushing. */ - if (!async && oldoff >> 16 != uio->uio_offset >> 16) { + if (!async && oldoff >> 16 != uio->uio_offset >> 16 + && ((ioflag & IO_FUA) == 0)) { simple_lock(&vp->v_interlock); error = VOP_PUTPAGES(vp, (oldoff >> 16) << 16, (uio->uio_offset >> 16) << 16, PGO_CLEANIT); } } if (error == 0 && ioflag & IO_SYNC) { + int f; + f = (ioflag & IO_FUA) ? PGO_FUA : 0; + f |= PGO_CLEANIT | PGO_SYNCIO; simple_lock(&vp->v_interlock); error = VOP_PUTPAGES(vp, trunc_page(oldoff), - round_page(blkroundup(fs, uio->uio_offset)), - PGO_CLEANIT | PGO_SYNCIO); + round_page(blkroundup(fs, uio->uio_offset)), f); } goto out; @@ -376,9 +379,12 @@ ext2fs_write(void *v) extended = 1; } + if (ioflag & IO_FUA) + bp->b_flags |= B_FUA; if (ioflag & IO_SYNC) (void)bwrite(bp); - else if (xfersize + blkoffset == fs->e2fs_bsize) + else if ((xfersize + blkoffset == fs->e2fs_bsize) + || (ioflag & IO_FUA)) bawrite(bp); else bdwrite(bp); Index: sys/ufs/lfs/lfs_vfsops.c =================================================================== RCS file: /cvsroot/src/sys/ufs/lfs/lfs_vfsops.c,v retrieving revision 1.222 diff -p -u -r1.222 lfs_vfsops.c --- sys/ufs/lfs/lfs_vfsops.c 4 Oct 2006 15:56:46 -0000 1.222 +++ sys/ufs/lfs/lfs_vfsops.c 1 Nov 2006 23:19:00 -0000 @@ -1651,6 +1651,8 @@ lfs_gop_write(struct vnode *vp, struct v mbp->b_data = (void *)kva; mbp->b_resid = mbp->b_bcount = bytes; mbp->b_flags = B_BUSY|B_WRITE|B_AGE|B_CALL; + if (flags & PGO_FUA) + mbp->b_flags |= B_FUA; mbp->b_iodone = uvm_aio_biodone; mbp->b_vp = vp; @@ -1715,6 +1717,8 @@ lfs_gop_write(struct vnode *vp, struct v (vaddr_t)(offset - pg->offset); bp->b_resid = bp->b_bcount = iobytes; bp->b_flags = B_BUSY|B_WRITE|B_CALL; + if (flags & PGO_FUA) + mbp->b_flags |= B_FUA; bp->b_iodone = uvm_aio_biodone1; } Index: sys/ufs/ufs/ufs_readwrite.c =================================================================== RCS file: /cvsroot/src/sys/ufs/ufs/ufs_readwrite.c,v retrieving revision 1.70 diff -p -u -r1.70 ufs_readwrite.c --- sys/ufs/ufs/ufs_readwrite.c 5 Oct 2006 14:48:33 -0000 1.70 +++ sys/ufs/ufs/ufs_readwrite.c 1 Nov 2006 23:19:00 -0000 @@ -277,7 +277,7 @@ WRITE(void *v) return (0); flags = ioflag & IO_SYNC ? B_SYNC : 0; - async = vp->v_mount->mnt_flag & MNT_ASYNC; + async = (vp->v_mount->mnt_flag & MNT_ASYNC); origoff = uio->uio_offset; resid = uio->uio_resid; osize = ip->i_size; @@ -400,8 +400,13 @@ WRITE(void *v) * XXXUBC simplistic async flushing. */ +<<<<<<< ufs_readwrite.c + if (!async && oldoff >> 16 != uio->uio_offset >> 16 + && ((ioflag & IO_FUA) == 0)) { +======= #ifndef LFS_READWRITE if (!async && oldoff >> 16 != uio->uio_offset >> 16) { +>>>>>>> 1.70 simple_lock(&vp->v_interlock); error = VOP_PUTPAGES(vp, (oldoff >> 16) << 16, (uio->uio_offset >> 16) << 16, PGO_CLEANIT); @@ -411,10 +416,12 @@ WRITE(void *v) #endif } if (error == 0 && ioflag & IO_SYNC) { + int f; + f = (ioflag & IO_FUA) ? PGO_FUA : 0; + f |= PGO_CLEANIT | PGO_SYNCIO; simple_lock(&vp->v_interlock); error = VOP_PUTPAGES(vp, trunc_page(origoff & fs->fs_bmask), - round_page(blkroundup(fs, uio->uio_offset)), - PGO_CLEANIT | PGO_SYNCIO); + round_page(blkroundup(fs, uio->uio_offset)), f); } goto out; @@ -465,15 +472,18 @@ WRITE(void *v) brelse(bp); break; } + if (ioflag & IO_FUA) + bp->b_flags |= B_FUA; #ifdef LFS_READWRITE (void)VOP_BWRITE(bp); lfs_reserve(fs, vp, NULL, -btofsb(fs, (NIADDR + 1) << fs->lfs_bshift)); need_unreserve = FALSE; #else - if (ioflag & IO_SYNC) + if (ioflag & IO_SYNC) { (void)bwrite(bp); - else if (xfersize + blkoffset == fs->fs_bsize) + } else if ((xfersize + blkoffset == fs->fs_bsize) + || (ioflag & IO_FUA)) bawrite(bp); else bdwrite(bp); Index: sys/uvm/uvm_pager.h =================================================================== RCS file: /cvsroot/src/sys/uvm/uvm_pager.h,v retrieving revision 1.34 diff -p -u -r1.34 uvm_pager.h --- sys/uvm/uvm_pager.h 22 Feb 2006 22:28:18 -0000 1.34 +++ sys/uvm/uvm_pager.h 1 Nov 2006 23:19:00 -0000 @@ -162,6 +162,8 @@ struct uvm_pagerops { #define PGO_PASTEOF 0x400 /* allow allocation of pages past EOF */ #define PGO_NOBLOCKALLOC 0x800 /* backing block allocation is not needed */ #define PGO_NOTIMESTAMP 0x1000 /* don't mark object accessed/modified */ +#define PGO_FUA 0x2000 /* Pass FUA down on I/O calls */ +#define PGO_FUA_NV 0x4000 /* Pass FUA_NV down on I/O calls */ /* page we are not interested in getting */ #define PGO_DONTCARE ((struct vm_page *) -1L) /* [get only] */
Attachment:
pgpf4_fspkq53.pgp
Description: PGP signature