Source-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[src/tls-maxphys]: src Initial snapshot of work to eliminate 64K MAXPHYS. Ba...



details:   https://anonhg.NetBSD.org/src/rev/ad9ed21278de
branches:  tls-maxphys
changeset: 852911:ad9ed21278de
user:      tls <tls%NetBSD.org@localhost>
date:      Wed Sep 12 06:15:31 2012 +0000

description:
Initial snapshot of work to eliminate 64K MAXPHYS.  Basically works for
physio (I/O to raw devices); needs more doing to get it going with the
filesystems, but it shouldn't damage data.

All work's been done on amd64 so far.  Not hard to add support to other
ports.  If others want to pitch in, one very helpful thing would be to
sort out when and how IDE disks can do 128K or larger transfers, and
adjust the various PCI IDE (or at least ahcisata) drivers and wd.c
accordingly -- it would make testing much easier.  Another very helpful
thing would be to implement a smart minphys() for RAIDframe along the
lines detailed in the MAXPHYS-NOTES file.

diffstat:

 MAXPHYS-NOTES                          |  76 ++++++++++++++++++++++++++++++++++
 sys/arch/amd64/include/param.h         |   6 +-
 sys/arch/i386/pnpbios/fdc_pnpbios.c    |   7 ++-
 sys/arch/i386/pnpbios/lpt_pnpbios.c    |   8 ++-
 sys/arch/i386/pnpbios/pciide_pnpbios.c |   7 ++-
 sys/arch/i386/pnpbios/pnpbios.c        |   7 ++-
 sys/arch/x68k/include/cdefs.h          |   2 +-
 sys/arch/x68k/include/cpufunc.h        |   2 +-
 sys/arch/x68k/include/ieeefp.h         |   2 +-
 sys/arch/x68k/include/profile.h        |   2 +-
 sys/arch/x68k/include/setjmp.h         |   2 +-
 sys/dev/acpi/acpi.c                    |   7 ++-
 sys/dev/ic/mpt_netbsd.c                |  18 +++----
 sys/dev/ic/mpt_netbsd.h                |   8 +++-
 sys/dev/isa/isa.c                      |   7 ++-
 sys/dev/pci/amr.c                      |  11 +++-
 sys/dev/pci/mlyvar.h                   |   4 +-
 sys/dev/pci/mpt_pci.c                  |   6 +-
 sys/dev/pci/pci.c                      |   7 ++-
 sys/dev/pci/pciide.c                   |   7 ++-
 sys/dev/scsipi/cd.c                    |  12 ++++-
 sys/dev/scsipi/sd.c                    |  12 ++++-
 sys/dev/scsipi/ss.c                    |  13 ++++-
 sys/kern/kern_physio.c                 |  12 ++--
 sys/kern/subr_autoconf.c               |  12 +++-
 sys/kern/subr_disk.c                   |  41 +++++++++++++++++-
 sys/kern/sys_descrip.c                 |   8 ++-
 sys/kern/vfs_vnops.c                   |  22 ++++++++-
 sys/kern/vfs_wapbl.c                   |  10 ++-
 sys/miscfs/genfs/genfs_io.c            |  27 +++++++++--
 sys/sys/device.h                       |  13 +++++-
 sys/sys/disk.h                         |   4 +-
 sys/sys/mount.h                        |   5 +-
 sys/ufs/ffs/ffs_vfsops.c               |  16 ++++++-
 sys/uvm/uvm_io.c                       |   7 +-
 sys/uvm/uvm_map.c                      |  14 ++++-
 sys/uvm/uvm_readahead.c                |  76 ++++++++++++++++-----------------
 sys/uvm/uvm_readahead.h                |  23 ++++++++-
 38 files changed, 390 insertions(+), 133 deletions(-)

diffs (truncated from 1386 to 300 lines):

diff -r f306f12b08cb -r ad9ed21278de MAXPHYS-NOTES
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/MAXPHYS-NOTES     Wed Sep 12 06:15:31 2012 +0000
@@ -0,0 +1,76 @@
+Notes on eliminating fixed (usually 64K) MAXPHYS, for more efficient
+operation both with single disk drives/SSDs (transfers in the 128K-256K
+range of sizes are advantageous for many workloads), and particularly with
+RAID sets (consider a typical 12-disk chassis of 2.5" SAS drives, set up
+as an entirely ordinary P+Q parity RAID array with a single hot spare.  To
+feed 64K transfers to each of the resulting 8 data disks requires 512K
+transfers fed to the RAID controller -- is it any wonder NetBSD performs
+so poorly with such hardware for many workloads?).
+
+The basic approach taken here:
+
+       1) Propagate maximum-transfer size down the device tree at
+          autoconf time.  Drivers take the max of their own
+          transfer-size limitations and their parents' limitations,
+          apply that in their minphys() routines (if they are disk
+          drivers) and propagate it down to their children.
+
+       2) This is just about sufficient, for physio, since once you've
+          got the disk, you can find its minphys routine, and *that*
+          can get access to the device-instance's softc which has the
+          size determined by autoconf.
+
+       3) For filesystem I/O, however, we need to be able to find that
+          maximum transfer size starting not with a device_t but with
+          a disk driver name (or major number) and unit number.
+
+          The "disk" interface within the kernel is extended to
+          let us fish out the dkdevice's minphys routine starting
+          with the data we've got.  We then feed a fake, huge buffer
+          to that minphys and see what we get back.
+
+          This is stashed in the mount point's datastructure and is
+          then available to the filesystem and pager code via
+          vp->v_mount any time you've got a filesystem-backed vnode.
+
+The rest is a "simple" matter of making the necessary MD adjustments
+and figuring out where the rest of the hidden 64K bottlenecks are....
+
+MAXPHYS is retained and is used as a default.  A new MACHINE_MAXPHYS
+must be defined, and is the actual largest transfer any hardware for
+a given port can do, or which the portmaster considers appropriate.
+
+MACHINE_MAXPHYS is used to size some on-stack arrays in the pager code
+so don't go too crazy with it.
+
+==== STATUS ====
+
+All work done on amd64.  Not hard to get it going on other ports.  Every
+top-level bus attachment will need code to clamp transfer sizes
+appropriately; see the PCI or ISA code here, or for an unfortunate
+example of when you have to clamp more than you'd like, the pnpbios code.
+
+Access through physio: done?  Disk drivers other than sd, cd, wd
+will need their minphys functions adjusted like those were, and
+will be limited to MAXPHYS per transfer until they do.
+
+       A notable exception is RAIDframe.  It could benefit immediately
+       but needs something a little more sophisticated done to its
+       minphys -- per-unit, it needs to sum up the maxphyses of the unit's
+       data (not parity!) components and return that value.
+
+Access through filesystems - for read, controlled by uvm readahead code.
+We can stash the ra max size in the ra ctx -- we can get it from v_mount
+in the vnode (the uobj!) *if* we put it into struct mount.  Then we only
+have to do the awful walk-the-device-list crap at mount time.  This likely
+wins!
+
+       Unfortunately, there is still a bottleneck, probably from
+       the pager code (genfs I/O code).  The genfs read/getpages
+       code is repellent and huge.  Haven't even started on it yet.
+
+I have attacked the genfs write path already, but though my printfs
+show the appropriate maxpages value propagates down, the resulting
+stream of I/O requests is 64K.  This needs further investigation:
+with maxcontig now gone from the FFS code, where on earth are we
+still clamping the I/O size?
diff -r f306f12b08cb -r ad9ed21278de sys/arch/amd64/include/param.h
--- a/sys/arch/amd64/include/param.h    Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/arch/amd64/include/param.h    Wed Sep 12 06:15:31 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: param.h,v 1.18 2012/04/20 22:23:24 rmind Exp $ */
+/*     $NetBSD: param.h,v 1.18.2.1 2012/09/12 06:15:31 tls Exp $       */
 
 #ifdef __x86_64__
 
@@ -45,9 +45,11 @@
 #define        DEV_BSIZE       (1 << DEV_BSHIFT)
 #define        BLKDEV_IOSIZE   2048
 #ifndef        MAXPHYS
-#define        MAXPHYS         (64 * 1024)     /* max raw I/O transfer size */
+#define        MAXPHYS         (64 * 1024)     /* default I/O transfer size max */
 #endif
 
+#define        MACHINE_MAXPHYS (1024 * 1024)   /* absolute I/O transfer size max */
+
 #define        SSIZE           1               /* initial stack size/NBPG */
 #define        SINCR           1               /* increment of stack/NBPG */
 #ifdef DIAGNOSTIC
diff -r f306f12b08cb -r ad9ed21278de sys/arch/i386/pnpbios/fdc_pnpbios.c
--- a/sys/arch/i386/pnpbios/fdc_pnpbios.c       Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/arch/i386/pnpbios/fdc_pnpbios.c       Wed Sep 12 06:15:31 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: fdc_pnpbios.c,v 1.17 2012/02/02 19:42:59 tls Exp $     */
+/*     $NetBSD: fdc_pnpbios.c,v 1.17.6.1 2012/09/12 06:15:32 tls Exp $ */
 
 /*-
  * Copyright (c) 2000 The NetBSD Foundation, Inc.
@@ -34,7 +34,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: fdc_pnpbios.c,v 1.17 2012/02/02 19:42:59 tls Exp $");
+__KERNEL_RCSID(0, "$NetBSD: fdc_pnpbios.c,v 1.17.6.1 2012/09/12 06:15:32 tls Exp $");
 
 
 
@@ -93,6 +93,9 @@
        fdc->sc_dev = self;
        fdc->sc_ic = aa->ic;
 
+       /* This is really ISA DMA under the covers: clamp max transfer size */
+       self->dv_maxphys = MIN(parent->dv_maxphys, 64 * 1024);
+
        if (pnpbios_io_map(aa->pbt, aa->resc, 0, &fdc->sc_iot,
             &pdc->sc_baseioh)) {
                aprint_error_dev(self, "unable to map I/O space\n");
diff -r f306f12b08cb -r ad9ed21278de sys/arch/i386/pnpbios/lpt_pnpbios.c
--- a/sys/arch/i386/pnpbios/lpt_pnpbios.c       Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/arch/i386/pnpbios/lpt_pnpbios.c       Wed Sep 12 06:15:31 2012 +0000
@@ -1,4 +1,4 @@
-/* $NetBSD: lpt_pnpbios.c,v 1.12 2011/07/01 18:14:15 dyoung Exp $ */
+/* $NetBSD: lpt_pnpbios.c,v 1.12.12.1 2012/09/12 06:15:32 tls Exp $ */
 /*
  * Copyright (c) 1999
  *     Matthias Drochner.  All rights reserved.
@@ -26,7 +26,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: lpt_pnpbios.c,v 1.12 2011/07/01 18:14:15 dyoung Exp $");
+__KERNEL_RCSID(0, "$NetBSD: lpt_pnpbios.c,v 1.12.12.1 2012/09/12 06:15:32 tls Exp $");
 
 #include <sys/param.h>
 #include <sys/systm.h>
@@ -77,6 +77,10 @@
 
        sc->sc_dev = self;
 
+       /* Lest someone attach a parallel-port SCSI adapter etc:
+         this is really ISA DMA under the covers: clamp max transfer size */
+        self->dv_maxphys = MIN(parent->dv_maxphys, 64 * 1024);
+
        if (pnpbios_io_map(aa->pbt, aa->resc, 0, &sc->sc_iot, &sc->sc_ioh)) {   
                printf(": can't map i/o space\n");
                return;
diff -r f306f12b08cb -r ad9ed21278de sys/arch/i386/pnpbios/pciide_pnpbios.c
--- a/sys/arch/i386/pnpbios/pciide_pnpbios.c    Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/arch/i386/pnpbios/pciide_pnpbios.c    Wed Sep 12 06:15:31 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: pciide_pnpbios.c,v 1.30 2012/07/31 15:50:32 bouyer Exp $       */
+/*     $NetBSD: pciide_pnpbios.c,v 1.30.2.1 2012/09/12 06:15:32 tls Exp $      */
 
 /*
  * Copyright (c) 1999 Soren S. Jorvang.  All rights reserved.
@@ -30,7 +30,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: pciide_pnpbios.c,v 1.30 2012/07/31 15:50:32 bouyer Exp $");
+__KERNEL_RCSID(0, "$NetBSD: pciide_pnpbios.c,v 1.30.2.1 2012/09/12 06:15:32 tls Exp $");
 
 #include <sys/param.h>
 #include <sys/systm.h>
@@ -88,6 +88,9 @@
        int i, drive, size;
        uint8_t idedma_ctl;
 
+       /* Clamp max transfer size - XXX how to do 128K on pciide? */
+       self->dv_maxphys = MIN(parent->dv_maxphys, IDEDMA_BYTE_COUNT_MAX);
+
        sc->sc_wdcdev.sc_atac.atac_dev = self;
 
        aprint_naive(": disk controller\n");
diff -r f306f12b08cb -r ad9ed21278de sys/arch/i386/pnpbios/pnpbios.c
--- a/sys/arch/i386/pnpbios/pnpbios.c   Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/arch/i386/pnpbios/pnpbios.c   Wed Sep 12 06:15:31 2012 +0000
@@ -1,4 +1,4 @@
-/* $NetBSD: pnpbios.c,v 1.71 2011/06/30 20:09:31 wiz Exp $ */
+/* $NetBSD: pnpbios.c,v 1.71.12.1 2012/09/12 06:15:32 tls Exp $ */
 
 /*
  * Copyright (c) 2000 Jason R. Thorpe.  All rights reserved.
@@ -41,7 +41,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: pnpbios.c,v 1.71 2011/06/30 20:09:31 wiz Exp $");
+__KERNEL_RCSID(0, "$NetBSD: pnpbios.c,v 1.71.12.1 2012/09/12 06:15:32 tls Exp $");
 
 #include <sys/param.h>
 #include <sys/systm.h>
@@ -289,6 +289,9 @@
        aprint_naive("\n");
 
        pnpbios_softc = sc;
+
+       /* We *don't* clamp xfer size here as both PCI and ISA devs
+          may attach beneath us */
        sc->sc_dev = self;
        sc->sc_ic = paa->paa_ic;
 
diff -r f306f12b08cb -r ad9ed21278de sys/arch/x68k/include/cdefs.h
--- a/sys/arch/x68k/include/cdefs.h     Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/arch/x68k/include/cdefs.h     Wed Sep 12 06:15:31 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: cdefs.h,v 1.1 1996/05/05 12:17:15 oki Exp $    */
+/*     $NetBSD: cdefs.h,v 1.1.1.1 1996/05/05 12:17:03 oki Exp $        */
 
 #ifndef _MACHINE_CDEFS_H_
 #define _MACHINE_CDEFS_H_
diff -r f306f12b08cb -r ad9ed21278de sys/arch/x68k/include/cpufunc.h
--- a/sys/arch/x68k/include/cpufunc.h   Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/arch/x68k/include/cpufunc.h   Wed Sep 12 06:15:31 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: cpufunc.h,v 1.1 1996/05/05 12:17:15 oki Exp $  */
+/*     $NetBSD: cpufunc.h,v 1.1.1.1 1996/05/05 12:17:03 oki Exp $      */
 
 /*
  * Functions to provide access to special cpu instructions.
diff -r f306f12b08cb -r ad9ed21278de sys/arch/x68k/include/ieeefp.h
--- a/sys/arch/x68k/include/ieeefp.h    Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/arch/x68k/include/ieeefp.h    Wed Sep 12 06:15:31 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: ieeefp.h,v 1.1 1996/05/05 12:17:14 oki Exp $   */
+/*     $NetBSD: ieeefp.h,v 1.1.1.1 1996/05/05 12:17:03 oki Exp $       */
 
 /* Just use the common m68k definition */
 #include <m68k/ieeefp.h>
diff -r f306f12b08cb -r ad9ed21278de sys/arch/x68k/include/profile.h
--- a/sys/arch/x68k/include/profile.h   Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/arch/x68k/include/profile.h   Wed Sep 12 06:15:31 2012 +0000
@@ -1,3 +1,3 @@
-/*     $NetBSD: profile.h,v 1.1 1996/05/05 12:17:14 oki Exp $  */
+/*     $NetBSD: profile.h,v 1.1.1.1 1996/05/05 12:17:03 oki Exp $      */
 
 #include <m68k/profile.h>
diff -r f306f12b08cb -r ad9ed21278de sys/arch/x68k/include/setjmp.h
--- a/sys/arch/x68k/include/setjmp.h    Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/arch/x68k/include/setjmp.h    Wed Sep 12 06:15:31 2012 +0000
@@ -1,3 +1,3 @@
-/*     $NetBSD: setjmp.h,v 1.1 1996/05/05 12:17:15 oki Exp $   */
+/*     $NetBSD: setjmp.h,v 1.1.1.1 1996/05/05 12:17:03 oki Exp $       */
 
 #include <m68k/setjmp.h>
diff -r f306f12b08cb -r ad9ed21278de sys/dev/acpi/acpi.c
--- a/sys/dev/acpi/acpi.c       Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/dev/acpi/acpi.c       Wed Sep 12 06:15:31 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: acpi.c,v 1.254 2012/08/14 14:38:02 jruoho Exp $        */
+/*     $NetBSD: acpi.c,v 1.254.2.1 2012/09/12 06:15:32 tls Exp $       */
 
 /*-
  * Copyright (c) 2003, 2007 The NetBSD Foundation, Inc.
@@ -100,7 +100,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: acpi.c,v 1.254 2012/08/14 14:38:02 jruoho Exp $");
+__KERNEL_RCSID(0, "$NetBSD: acpi.c,v 1.254.2.1 2012/09/12 06:15:32 tls Exp $");
 
 #include "opt_acpi.h"
 #include "opt_pcifixup.h"
@@ -438,6 +438,9 @@
 
        acpi_unmap_rsdt(rsdt);
 
+       /* Clamp the max transfer size - assume LPC devs may be beneath us. */
+       self->dv_maxphys = MIN(parent->dv_maxphys, 64 * 1024);
+
        sc->sc_dev = self;
        sc->sc_root = NULL;
 
diff -r f306f12b08cb -r ad9ed21278de sys/dev/ic/mpt_netbsd.c
--- a/sys/dev/ic/mpt_netbsd.c   Wed Sep 12 02:00:51 2012 +0000
+++ b/sys/dev/ic/mpt_netbsd.c   Wed Sep 12 06:15:31 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: mpt_netbsd.c,v 1.18 2012/03/18 21:05:21 martin Exp $   */
+/*     $NetBSD: mpt_netbsd.c,v 1.18.2.1 2012/09/12 06:15:32 tls Exp $  */
 
 /*
  * Copyright (c) 2003 Wasabi Systems, Inc.
@@ -77,7 +77,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: mpt_netbsd.c,v 1.18 2012/03/18 21:05:21 martin Exp $");



Home | Main Index | Thread Index | Old Index