tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Resolving open/close, attach/detach races



New draft changes to resolve open/close/attach/detach races.

There's one snag in here that I haven't been able to resolve -- a race
in concurrent revoke and detach; maybe hannken@ can help?  The snag is
that vdevgone needs to revoke all existing device nodes _and_ wait for
the revocation to complete -- even if it is happening concurrently via
the revoke(2) system call.

But spec_lookup_by_dev will just skip vnodes currently being revoked,
and possibly return before they have finished being revoked.  I tried
making spec_lookup_by_dev wait with vdead_check(vp, 0) instead of
vdead_check(vp, VDEAD_NOWAIT) in some circumstances, but that didn't
work -- it led to deadlock or livelock (and then kernel lock spinout),
depending on how I did it.

My attempts to make this work may have failed because vdead_check is
forbidden if the caller doesn't hold a vnode reference, and it's not
clear taking a vnode reference is allowed at this point.  Generally,
taking a new reference to a vnode being reclaimed should not be
allowed.

(I don't entirely understand ad@'s recent(ish) changes to allow it in
some cases, which strikes me as a regression from the system we had
before where VOP_INACTIVE's decision is final -- a huge improvement
over the piles of bug-ridden incoherent gobbledegook we used to have
to deal with vnodes being revived in the middle of reclamation.)


Here's the main changes (opendetach-v7.patch / opendetach-v7.diff),
including driver responsibilities (which should eventually be
documented in driver(9)):

- bdev/cdev_open now blocks concurrent devsw_detach, so modules that
  use devsw_detach won't finish unloading until bdev/cdev_open have
  finished.  devsw_detach also returns void now so callers don't need
  to handle errors.

  Caller's responsibility:

  . Ensure .d_open will fail by the time it calls devsw_detach, e.g.
    by doing config_cfdriver_detach (maybe via config_fini_component)
    first and making .d_open fail if there is no autoconf unit.

  . Ensure all other .d_* calls have completed, e.g. by calling
    vdevgone in .ca_detach to revoke the device node and wait for all
    I/O operations on it to complete.

  bdev/cdev_* operations _other than_ open do not block concurrent
  devsw_detach.  This is not a problem as long as callers follow the
  rules.

- New device_lookup_acquire/device_release gives a device_t with a
  reference that does not go away until device_release.  They block if
  .ca_detach is already in progress, but they don't prevent new calls
  to .ca_detach from starting (which is necessary for ttys because
  opening a tty may hang indefinitely, so .ca_detach for, e.g.,
  ucom(4) needs to interrupt pending opens).

  Most of the time you don't need to call these yourself: For any
  bdevsw/cdevsw driver that opts in by setting the new .d_cfdriver and
  .d_devtounit, bdev/cdev_open will take a reference to the autoconf
  device while it runs (or fails if there is none).  So .d_open is
  guaranteed device_lookup and device_lookup_private will not fail and
  the softc will not go away before it returns.

  => Why only bdev/cdev_open and not the other operations?  The other
     ones like bdev_strategy or cdev_ioctl sometimes need to run in
     contexts where blocking is not allowed, so device_lookup_acquire
     can't wait for .ca_detach to complete -- but these are also used
     only in contexts where the device can't have gone away yet,
     bracketed by open and close, so no acquire/release is needed.

  The attached patch to audio(4) (audioopen.patch) shows examples both
  of using device_lookup_acquire directly (for audiobellopen, which
  doesn't go through any devsw) and of using it indirectly via
  .d_cfdriver/.d_devotunit (for audioopen, of /dev/audio).

- specfs guarantees that spec_node_revoke will not return until all
  concurrent devsw operations have completed and no new ones can
  start.  Thus, except for the snag with spec_lookup_by_dev, vdevgone
  will not return until all instances have been closed and all .d_*
  operations have returned.

  Driver's responsibility:

  . Ensure .d_close will interrupt any concurrent I/O operations --
    including possibly .d_open! -- so they return promptly.  For
    example, a tty driver must call ttyclose so that any concurrent
    ttyopen waiting for carrier will fail with ERESTART.

    Caveat: Often a driver's .d_close function also needs to wait for
    I/O operations in order to free resources they are using, like
    buffers, USB pipes, &c.  Currently this requires driver-specific
    logic to track and wait for pending I/O operations.

    => I also drafted a patch to add a new .d_cancel to bdevsw/cdevsw
       to make this easier: when closing or revoking a device, specfs
       will call .d_cancel first to interrupt pending I/O operations,
       and then wait for them to complete, and then call .d_close once
       they are guaranteed to be completed (except that new .d_open is
       still possible but a driver can make it block).

       The .d_cancel mechanism will help us to simultaneously
       eliminate a large class of bugs and a large swaths of driver-
       specific I/O reference counting, making drivers easier to
       understand, audit, and maintain.  I'm including an example of
       these changes for ucom(4) (ucom.patch) which removes 6% of the
       lines in the file, including a lot of dead branches -- needs
       more testing but it's getting there.

  In the course of this change I fixed a number of use-after-free bugs
  in specfs, deleted some unnecessary code, sprinkled in some new
  assertions, and refactored to make the logic a little clearer.

  The mechanism that specfs uses to wait for I/O operations to drain
  is currently a reference count called sn_iocnt in struct specnode
  managed with atomics.  We could use psref or localcount instead if
  this becomes a bottleneck -- probably not a big deal.

  XXX OOPS: While writing this I realized it might be necessary to
  make it struct specdev::sd_iocnt, not struct specnode::sn_iocnt, and
  make spec_node_revoke only drain I/O operations if it is the last
  remaining specnode.  I've only done testing with a single /dev so
  far so each specdev has only one specnode.

  XXX It might also be a good idea to have specfs make further calls
  to .d_open block while spec_close is running; currently, and with
  the patch series, it is the driver's responsibility to deal with
  this possibility, but reasoning about this is difficult.  chuq tried
  a reader/writer lock around open/close, but that doesn't quite work
  because close has to be able to interrupt open -- but maybe open can
  just wait for any concurrent close to finish, without blocking a new
  close from interrupting.


In a separate patch (cancel.patch), since it more substantively
changes the NetBSD device driver API:

- New bdevsw/cdevsw operation .d_cancel to interrupt all pending I/O
  operations and make them return promptly.

  If provided, then when an open device vnode (or a device vnode in
  the middle of .d_open even if it hasn't finished yet) is closed (for
  the last time, if opened several times), specfs will first call
  .d_cancel, and then wait for all concurrent devsw I/O operations on
  that vnode to drain before calling .d_close.

  specfs already has logic to drain the I/O operations, in
  spec_node_revoke -- it just uses the same logic to drain them a
  little earlier in spec_close for drivers with .d_cancel functions.

  This way .d_close is guaranteed that there are no pending I/O
  operations and can free resources that would be used by them.  For
  instance, a USB driver might do something like this:

	struct uvwxyz_softc {
		...
		kmutex_t sc_lock;
		kcondvar_t sc_cv;
		bool sc_dying;
		bool sc_closing;
		...
	};

	uvwxyz_open(...)
	{
		struct uvwxyz_softc *sc = device_lookup(...);

		mutex_enter(&sc->sc_lock);
		if (sc->sc_dying) {
			mutex_exit(&sc->sc_lock);
			return ENXIO;
		}
		if (sc->sc_closing) {
			mutex_exit(&sc->sc_lock);
			return EIO;
		}
		... if (firstopen) usbd_open_pipe(..., &sc->sc_pipe); ...
		... if (firstopen) sc->sc_buf = kmem_alloc(...); ...
	}

	uvwxyz_read/write/ioctl(...)
	{
		...
		mutex_enter(&sc->sc_lock);
		if (sc->sc_closing) {
			error = EIO;
			goto out;
                }
		... usbd_transfer(... sc->sc_buf ...) ...
		/* sleep for xfer completion unless closing */
		while (!sc->sc_io_ready) {
			if (sc->sc_closing) {
				error = EIO;
				goto out;
			}
			... cv_wait_sig(&sc->sc_cv, &sc->sc_lock) ...
		}
		...
	}

	uvwxyz_cancel(...)
	{

		/* wake anyone sleeping in wait for xfer completion */
		mutex_enter(&sc->sc_lock);
		sc->sc_closing = true;
		cv_broadcast(&sc->sc_cv);
		mutex_exit(&sc->sc_lock);

		/* cancel all xfers */
		usbd_abort_pipe(sc->sc_pipe);
	}

	uvwxyz_close(...)
	{

		/* no xfers in flight, no use of buffer -- safe to free */
		usbd_close_pipe(sc->sc_pipe);
		kmem_free(sc->sc_buf);

		/* allow new opens to proceed again */
		mutex_enter(&sc->sc_lock);
		sc->sc_closing = false;
		cv_broadcast(&sc->sc_cv);
		mutex_exit(&sc->sc_lock);
	}

	uvwxyz_detach(...)
	{

		/* make new opens fail immediately */
                mutex_enter(&sc->sc_lock);
		sc->sc_dying = true;
                mutex_exit(&sc->sc_lock);

		/*
		 * Close all existing instances and wait for all
		 * pending I/O operations to drain.
		 */
		vdevgone(maj, mn, mn, VCHR);

		/*
		 * At this point, there are no devsw operations in
		 * flight except posssibly uvwxyz_open, which will
		 * fail immediately because we set sc_dying, and
		 * uvwxyz_close will have closed any pipes.  So we
		 * usually only need to free resources that were
		 * allocated by attach at this point like sc_lock.
		 */
		...
	}

  As is, every correct driver has to have its own way to delimit I/O
  operations in uvwxyz_read/write/ioctl and wait for them to complete
  in uvwxyz_close.  This is what usb_detach_waitold, sc_detachcv, and
  various driver-specific reference-counting mechanisms are for; they
  are no longer necessary in the new regime.

  (In fact I suspect most drivers today are just broken in this area;
  my goal for netbsd-10 is to establish the mechanism we need to fix
  _and_ simplify a lot of drivers at the same time, so we can pull
  those fixes up to the branch.)


Comments welcome!  Especially requests for clarification or `this
seems too complicated, can you give me an example of how a driver
meets its responsibility?' or `I don't believe xyz mechanism in the
patch series is necessary, can you justify it?'.

Testing wouldn't hurt either!  Ideally, these changes shouldn't make
anything worse than it already is.  I've done some stress-testing with
open/close/revoke/detach on a Thinkpad T480 with a couple USB drivers
(uhid, ucom) modified to take advantage of the new .d_cancel, but I'm
stuck on the snag with spec_lookup_by_dev.
>From 780a065ee2763774a4712da9352bb6c415234478 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 12 Jan 2022 00:46:40 +0000
Subject: [PATCH 01/32] driver(9): devsw_detach never fails.  Make it return
 void.

Prune a whole lotta dead branches as a result of this.  (Some logic
calling this is also wrong for other reasons; devsw_detach is final
-- you should never have any reason to decide to roll it back.  To be
cleaned up in subsequent commits...)
---
 external/cddl/osnet/dev/dtrace/dtrace_modevent.c  |  4 +---
 external/cddl/osnet/dev/fbt/fbt.c                 |  3 ++-
 external/cddl/osnet/dev/sdt/sdt.c                 |  3 ++-
 .../cddl/osnet/dist/uts/common/fs/zfs/zfs_ioctl.c |  2 +-
 share/man/man9/devsw_attach.9                     | 15 +++++++++------
 sys/coda/coda_psdev.c                             |  2 +-
 sys/dev/ccd.c                                     |  2 +-
 sys/dev/clockctl.c                                |  8 +++-----
 sys/dev/hdaudio/hdaudio.c                         |  6 +-----
 sys/dev/i2c/i2c.c                                 |  7 ++-----
 sys/dev/pad/pad.c                                 |  4 +---
 sys/dev/raidframe/rf_netbsdkintf.c                | 11 +----------
 sys/dev/sysmon/sysmon.c                           |  2 +-
 sys/dev/tprof/tprof.c                             |  8 +-------
 sys/dist/pf/net/pf_ioctl.c                        |  3 ++-
 sys/external/bsd/ipf/netinet/ip_fil_netbsd.c      |  2 +-
 sys/fs/autofs/autofs_vfsops.c                     |  4 +---
 sys/kern/kern_drvctl.c                            |  9 ++-------
 sys/kern/subr_devsw.c                             |  3 +--
 sys/modules/examples/pollpal/pollpal.c            |  3 ++-
 sys/net/if_tap.c                                  |  5 +----
 sys/net/if_tun.c                                  |  9 +--------
 sys/rump/dev/lib/libbpf/bpf_component.c           |  3 +--
 sys/rump/dev/lib/libdrvctl/drvctl_component.c     |  4 +---
 sys/sys/conf.h                                    |  2 +-
 25 files changed, 41 insertions(+), 83 deletions(-)

diff --git a/external/cddl/osnet/dev/dtrace/dtrace_modevent.c b/external/cddl/osnet/dev/dtrace/dtrace_modevent.c
index cc0f8103fb98..f1338840125a 100644
--- a/external/cddl/osnet/dev/dtrace/dtrace_modevent.c
+++ b/external/cddl/osnet/dev/dtrace/dtrace_modevent.c
@@ -42,9 +42,7 @@ dtrace_modcmd(modcmd_t cmd, void *data)
 		return error;
 
 	case MODULE_CMD_FINI:
-		error = devsw_detach(NULL, &dtrace_cdevsw);
-		if (error != 0)
-			return error;
+		devsw_detach(NULL, &dtrace_cdevsw);
 
 		error = dtrace_unload();
 		if (error != 0) {
diff --git a/external/cddl/osnet/dev/fbt/fbt.c b/external/cddl/osnet/dev/fbt/fbt.c
index b367c2155292..46dd7c1f7f06 100644
--- a/external/cddl/osnet/dev/fbt/fbt.c
+++ b/external/cddl/osnet/dev/fbt/fbt.c
@@ -1329,7 +1329,8 @@ dtrace_fbt_modcmd(modcmd_t cmd, void *data)
 		error = fbt_unload();
 		if (error != 0)
 			return error;
-		return devsw_detach(NULL, &fbt_cdevsw);
+		devsw_detach(NULL, &fbt_cdevsw);
+		return 0;
 	case MODULE_CMD_AUTOUNLOAD:
 		return EBUSY;
 	default:
diff --git a/external/cddl/osnet/dev/sdt/sdt.c b/external/cddl/osnet/dev/sdt/sdt.c
index c3ad129f8284..5a41270a2917 100644
--- a/external/cddl/osnet/dev/sdt/sdt.c
+++ b/external/cddl/osnet/dev/sdt/sdt.c
@@ -562,7 +562,8 @@ dtrace_sdt_modcmd(modcmd_t cmd, void *data)
 		error = sdt_unload();
 		if (error != 0)
 			return error;
-		return devsw_detach(NULL, &sdt_cdevsw);
+		devsw_detach(NULL, &sdt_cdevsw);
+		return 0;
 	case MODULE_CMD_AUTOUNLOAD:
 		return EBUSY;
 	default:
diff --git a/external/cddl/osnet/dist/uts/common/fs/zfs/zfs_ioctl.c b/external/cddl/osnet/dist/uts/common/fs/zfs/zfs_ioctl.c
index 9e19cd1dc0c3..d74d8c71e54d 100644
--- a/external/cddl/osnet/dist/uts/common/fs/zfs/zfs_ioctl.c
+++ b/external/cddl/osnet/dist/uts/common/fs/zfs/zfs_ioctl.c
@@ -7231,7 +7231,7 @@ zfs_modcmd(modcmd_t cmd, void *arg)
 		if (error)
 			return error;
 
-		(void) devsw_detach(&zfs_bdevsw, &zfs_cdevsw);
+		devsw_detach(&zfs_bdevsw, &zfs_cdevsw);
 
 attacherr:
 		zfs_sysctl_fini();
diff --git a/share/man/man9/devsw_attach.9 b/share/man/man9/devsw_attach.9
index cf862be5846a..6ffc51957a3f 100644
--- a/share/man/man9/devsw_attach.9
+++ b/share/man/man9/devsw_attach.9
@@ -27,7 +27,7 @@
 .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 .\" POSSIBILITY OF SUCH DAMAGE.
 .\"
-.Dd April 30, 2017
+.Dd January 11, 2022
 .Dt DEVSW 9
 .Os
 .Sh NAME
@@ -49,7 +49,7 @@
 .Fa "const struct cdevsw *cdev"
 .Fa "devmajor_t *cmajor"
 .Fc
-.Ft int
+.Ft void
 .Fo devsw_detach
 .Fa "const struct bdevsw *bdev"
 .Fa "const struct cdevsw *cdev"
@@ -130,6 +130,11 @@ and
 structures.
 .Fn devsw_detach
 should be called before a loaded device driver is unloaded.
+The caller must ensure that there are no open instances of the device,
+and that the device's
+.Fn d_open
+function will fail, before calling.
+Fn devsw_detach .
 .Pp
 The
 .Fn bdevsw_lookup
@@ -155,10 +160,8 @@ or
 .Sh RETURN VALUES
 Upon successful completion,
 .Fn devsw_attach
-and
-.Fn devsw_detach
-return 0.
-Otherwise they return an error value.
+returns 0.
+Otherwise it returns an error value.
 .Pp
 In case of failure,
 .Fn bdevsw_lookup
diff --git a/sys/coda/coda_psdev.c b/sys/coda/coda_psdev.c
index cede16da3f53..7f531f03fe56 100644
--- a/sys/coda/coda_psdev.c
+++ b/sys/coda/coda_psdev.c
@@ -758,7 +758,7 @@ vcoda_modcmd(modcmd_t cmd, void *arg)
 				if (VC_OPEN(vcp))
 					return EBUSY;
 			}
-			return devsw_detach(NULL, &vcoda_cdevsw);
+			devsw_detach(NULL, &vcoda_cdevsw);
 		}
 #endif
 		break;
diff --git a/sys/dev/ccd.c b/sys/dev/ccd.c
index 05945f9a67ba..2283bc0346da 100644
--- a/sys/dev/ccd.c
+++ b/sys/dev/ccd.c
@@ -1710,7 +1710,7 @@ ccd_modcmd(modcmd_t cmd, void *arg)
 			error = EBUSY;
 		} else {
 			mutex_exit(&ccd_lock);
-			error = devsw_detach(&ccd_bdevsw, &ccd_cdevsw);
+			devsw_detach(&ccd_bdevsw, &ccd_cdevsw);
 			ccddetach();
 		}
 #endif
diff --git a/sys/dev/clockctl.c b/sys/dev/clockctl.c
index 0da5e7765fe8..9685c0f129f6 100644
--- a/sys/dev/clockctl.c
+++ b/sys/dev/clockctl.c
@@ -182,14 +182,12 @@ clockctl_modcmd(modcmd_t cmd, void *data)
 			return EBUSY;
 		}
 #ifdef _MODULE
-		error = devsw_detach(NULL, &clockctl_cdevsw);
+		devsw_detach(NULL, &clockctl_cdevsw);
 #endif
 		mutex_exit(&clockctl_mtx);
 
-		if (error == 0) {
-			kauth_unlisten_scope(clockctl_listener);
-			mutex_destroy(&clockctl_mtx);
-		}
+		kauth_unlisten_scope(clockctl_listener);
+		mutex_destroy(&clockctl_mtx);
 		break;
 
 	default:
diff --git a/sys/dev/hdaudio/hdaudio.c b/sys/dev/hdaudio/hdaudio.c
index d39ff2db6cde..5c7874778e22 100644
--- a/sys/dev/hdaudio/hdaudio.c
+++ b/sys/dev/hdaudio/hdaudio.c
@@ -1636,11 +1636,7 @@ hdaudio_modcmd(modcmd_t cmd, void *opaque)
 		error = config_cfdriver_detach(&hdaudio_cd);
 		if (error)
 			break;
-		error = devsw_detach(NULL, &hdaudio_cdevsw);
-		if (error) {
-			config_cfdriver_attach(&hdaudio_cd);
-			break;
-		}
+		devsw_detach(NULL, &hdaudio_cdevsw);
 #endif
 		break;
 	default:
diff --git a/sys/dev/i2c/i2c.c b/sys/dev/i2c/i2c.c
index 6f2c0c6a9698..36a3e87d5316 100644
--- a/sys/dev/i2c/i2c.c
+++ b/sys/dev/i2c/i2c.c
@@ -942,7 +942,7 @@ iic_modcmd(modcmd_t cmd, void *opaque)
 		if (error) {
 			aprint_error("%s: unable to init component\n",
 			    iic_cd.cd_name);
-			(void)devsw_detach(NULL, &iic_cdevsw);
+			devsw_detach(NULL, &iic_cdevsw);
 		}
 		mutex_exit(&iic_mtx);
 #endif
@@ -960,10 +960,7 @@ iic_modcmd(modcmd_t cmd, void *opaque)
 			mutex_exit(&iic_mtx);
 			break;
 		}
-		error = devsw_detach(NULL, &iic_cdevsw);
-		if (error != 0)
-			config_init_component(cfdriver_ioconf_iic,
-			    cfattach_ioconf_iic, cfdata_ioconf_iic);
+		devsw_detach(NULL, &iic_cdevsw);
 #endif
 		mutex_exit(&iic_mtx);
 		break;
diff --git a/sys/dev/pad/pad.c b/sys/dev/pad/pad.c
index fe0b429cf386..a779f1f71b8d 100644
--- a/sys/dev/pad/pad.c
+++ b/sys/dev/pad/pad.c
@@ -777,9 +777,7 @@ pad_modcmd(modcmd_t cmd, void *arg)
 
 	case MODULE_CMD_FINI:
 #ifdef _MODULE
-		error = devsw_detach(NULL, &pad_cdevsw);
-		if (error)
-			break;
+		devsw_detach(NULL, &pad_cdevsw);
 
 		error = config_fini_component(cfdriver_ioconf_pad,
 		    pad_cfattach, cfdata_ioconf_pad);
diff --git a/sys/dev/raidframe/rf_netbsdkintf.c b/sys/dev/raidframe/rf_netbsdkintf.c
index d1bda3553e03..87439aa70bfb 100644
--- a/sys/dev/raidframe/rf_netbsdkintf.c
+++ b/sys/dev/raidframe/rf_netbsdkintf.c
@@ -4088,16 +4088,7 @@ raid_modcmd_fini(void)
 		return error;
 	}
 #endif
-	error = devsw_detach(&raid_bdevsw, &raid_cdevsw);
-	if (error != 0) {
-		aprint_error("%s: cannot detach devsw\n",__func__);
-#ifdef _MODULE
-		config_cfdriver_attach(&raid_cd);
-#endif
-		config_cfattach_attach(raid_cd.cd_name, &raid_ca);
-		mutex_exit(&raid_lock);
-		return error;
-	}
+	devsw_detach(&raid_bdevsw, &raid_cdevsw);
 	rf_BootRaidframe(false);
 #if (RF_INCLUDE_PARITY_DECLUSTERING_DS > 0)
 	rf_destroy_mutex2(rf_sparet_wait_mutex);
diff --git a/sys/dev/sysmon/sysmon.c b/sys/dev/sysmon/sysmon.c
index 46aedaad7337..fd2993f0c180 100644
--- a/sys/dev/sysmon/sysmon.c
+++ b/sys/dev/sysmon/sysmon.c
@@ -356,7 +356,7 @@ sysmon_fini(void)
 	if (error == 0) {
 		mutex_enter(&sysmon_minor_mtx);
 		sm_is_attached = false;
-		error = devsw_detach(NULL, &sysmon_cdevsw);
+		devsw_detach(NULL, &sysmon_cdevsw);
 		mutex_exit(&sysmon_minor_mtx);
 	}
 #endif
diff --git a/sys/dev/tprof/tprof.c b/sys/dev/tprof/tprof.c
index b069a5b7df5d..136fd190ad14 100644
--- a/sys/dev/tprof/tprof.c
+++ b/sys/dev/tprof/tprof.c
@@ -768,13 +768,7 @@ tprof_modcmd(modcmd_t cmd, void *arg)
 
 	case MODULE_CMD_FINI:
 #if defined(_MODULE)
-		{
-			int error;
-			error = devsw_detach(NULL, &tprof_cdevsw);
-			if (error) {
-				return error;
-			}
-		}
+		devsw_detach(NULL, &tprof_cdevsw);
 #endif /* defined(_MODULE) */
 		tprof_driver_fini();
 		return 0;
diff --git a/sys/dist/pf/net/pf_ioctl.c b/sys/dist/pf/net/pf_ioctl.c
index 94bfb70a411d..e4c13be698f8 100644
--- a/sys/dist/pf/net/pf_ioctl.c
+++ b/sys/dist/pf/net/pf_ioctl.c
@@ -3459,7 +3459,8 @@ pf_modcmd(modcmd_t cmd, void *opaque)
 		} else {
 			pfdetach();
 			pflogdetach();
-			return devsw_detach(NULL, &pf_cdevsw);
+			devsw_detach(NULL, &pf_cdevsw);
+			return 0;
 		}
 	default:
 		return ENOTTY;
diff --git a/sys/external/bsd/ipf/netinet/ip_fil_netbsd.c b/sys/external/bsd/ipf/netinet/ip_fil_netbsd.c
index d0c4ca95097c..bb4e0706cc9b 100644
--- a/sys/external/bsd/ipf/netinet/ip_fil_netbsd.c
+++ b/sys/external/bsd/ipf/netinet/ip_fil_netbsd.c
@@ -2256,7 +2256,7 @@ ipl_fini(void *opaque)
 {
 
 #ifdef _MODULE
-	(void)devsw_detach(NULL, &ipl_cdevsw);
+	devsw_detach(NULL, &ipl_cdevsw);
 #endif
 
 	/*
diff --git a/sys/fs/autofs/autofs_vfsops.c b/sys/fs/autofs/autofs_vfsops.c
index fbd6eafe6532..1204d1f9b6d3 100644
--- a/sys/fs/autofs/autofs_vfsops.c
+++ b/sys/fs/autofs/autofs_vfsops.c
@@ -496,9 +496,7 @@ autofs_modcmd(modcmd_t cmd, void *arg)
 		}
 		mutex_exit(&autofs_softc->sc_lock);
 
-		error = devsw_detach(NULL, &autofs_cdevsw);
-		if (error)
-			break;
+		devsw_detach(NULL, &autofs_cdevsw);
 #endif
 		error = vfs_detach(&autofs_vfsops);
 		if (error)
diff --git a/sys/kern/kern_drvctl.c b/sys/kern/kern_drvctl.c
index 37f4730b2512..8a4156f8a0aa 100644
--- a/sys/kern/kern_drvctl.c
+++ b/sys/kern/kern_drvctl.c
@@ -665,15 +665,10 @@ drvctl_modcmd(modcmd_t cmd, void *arg)
 		devmon_insert_vec = saved_insert_vec;
 		saved_insert_vec = NULL;
 #ifdef _MODULE
-		error = devsw_detach(NULL, &drvctl_cdevsw);
-		if (error != 0) {
-			saved_insert_vec = devmon_insert_vec;
-			devmon_insert_vec = devmon_insert;
-		}
+		devsw_detach(NULL, &drvctl_cdevsw);
 #endif
 		mutex_exit(&drvctl_lock);
-		if (error == 0)
-			drvctl_fini();
+		drvctl_fini();
 
 		break;
 	default:
diff --git a/sys/kern/subr_devsw.c b/sys/kern/subr_devsw.c
index 1a0f721fdd65..b95a09d3e6b6 100644
--- a/sys/kern/subr_devsw.c
+++ b/sys/kern/subr_devsw.c
@@ -343,14 +343,13 @@ devsw_detach_locked(const struct bdevsw *bdev, const struct cdevsw *cdev)
 	}
 }
 
-int
+void
 devsw_detach(const struct bdevsw *bdev, const struct cdevsw *cdev)
 {
 
 	mutex_enter(&device_lock);
 	devsw_detach_locked(bdev, cdev);
 	mutex_exit(&device_lock);
-	return 0;
 }
 
 /*
diff --git a/sys/modules/examples/pollpal/pollpal.c b/sys/modules/examples/pollpal/pollpal.c
index b76e0733699b..d8ddc73450e0 100644
--- a/sys/modules/examples/pollpal/pollpal.c
+++ b/sys/modules/examples/pollpal/pollpal.c
@@ -311,7 +311,8 @@ pollpal_modcmd(modcmd_t cmd, void *arg __unused)
 	case MODULE_CMD_FINI:
 		if (pollpal_nopen != 0)
 			return EBUSY;
-		return devsw_detach(NULL, &pollpal_cdevsw);
+		devsw_detach(NULL, &pollpal_cdevsw);
+		return 0;
 	default:
 		return ENOTTY;
 	} 
diff --git a/sys/net/if_tap.c b/sys/net/if_tap.c
index 0b57ad4a711c..314f4647707c 100644
--- a/sys/net/if_tap.c
+++ b/sys/net/if_tap.c
@@ -256,9 +256,7 @@ tapdetach(void)
 
 	if_clone_detach(&tap_cloners);
 #ifdef _MODULE
-	error = devsw_detach(NULL, &tap_cdevsw);
-	if (error != 0)
-		goto out2;
+	devsw_detach(NULL, &tap_cdevsw);
 #endif
 
 	if (tap_count != 0) {
@@ -277,7 +275,6 @@ tapdetach(void)
  out1:
 #ifdef _MODULE
 	devsw_attach("tap", NULL, &tap_bmajor, &tap_cdevsw, &tap_cmajor);
- out2:
 #endif
 	if_clone_attach(&tap_cloners);
 
diff --git a/sys/net/if_tun.c b/sys/net/if_tun.c
index 4f533a8f08d1..f4e5b6d86d43 100644
--- a/sys/net/if_tun.c
+++ b/sys/net/if_tun.c
@@ -142,17 +142,10 @@ tuninit(void)
 static int
 tundetach(void)
 {
-#ifdef _MODULE
-	int error;
-#endif
 
 	if_clone_detach(&tun_cloner);
 #ifdef _MODULE
-	error = devsw_detach(NULL, &tun_cdevsw);
-	if (error != 0) {
-		if_clone_attach(&tun_cloner);
-		return error;
-	}
+	devsw_detach(NULL, &tun_cdevsw);
 #endif
 
 	if (!LIST_EMPTY(&tun_softc_list) || !LIST_EMPTY(&tunz_softc_list)) {
diff --git a/sys/rump/dev/lib/libbpf/bpf_component.c b/sys/rump/dev/lib/libbpf/bpf_component.c
index 05807d371d40..d41d1987afe8 100644
--- a/sys/rump/dev/lib/libbpf/bpf_component.c
+++ b/sys/rump/dev/lib/libbpf/bpf_component.c
@@ -50,6 +50,5 @@ RUMP_COMPONENT(RUMP_COMPONENT_NET)
 		panic("bpf devsw attach failed: %d", error);
 	if ((error = rump_vfs_makeonedevnode(S_IFCHR, "/dev/bpf", cmaj, 0)) !=0)
 		panic("cannot create bpf device nodes: %d", error);
-	if ((error = devsw_detach(NULL, &bpf_cdevsw)) != 0)
-		panic("cannot detach bpf devsw: %d", error);
+	devsw_detach(NULL, &bpf_cdevsw);
 }
diff --git a/sys/rump/dev/lib/libdrvctl/drvctl_component.c b/sys/rump/dev/lib/libdrvctl/drvctl_component.c
index e2e79f45f9de..ac4e103fdb9c 100644
--- a/sys/rump/dev/lib/libdrvctl/drvctl_component.c
+++ b/sys/rump/dev/lib/libdrvctl/drvctl_component.c
@@ -51,7 +51,5 @@ RUMP_COMPONENT(RUMP_COMPONENT_DEV)
 	if ( error !=0)
 		panic("cannot create drvctl device node: %d", error);
 
-	error = devsw_detach(NULL, &drvctl_cdevsw);
-	if (error != 0)
-		panic("cannot detach drvctl devsw: %d", error);
+	devsw_detach(NULL, &drvctl_cdevsw);
 }
diff --git a/sys/sys/conf.h b/sys/sys/conf.h
index 081631d2111f..ba46976a924b 100644
--- a/sys/sys/conf.h
+++ b/sys/sys/conf.h
@@ -104,7 +104,7 @@ extern kmutex_t device_lock;
 
 int devsw_attach(const char *, const struct bdevsw *, devmajor_t *,
 		 const struct cdevsw *, devmajor_t *);
-int devsw_detach(const struct bdevsw *, const struct cdevsw *);
+void devsw_detach(const struct bdevsw *, const struct cdevsw *);
 const struct bdevsw *bdevsw_lookup(dev_t);
 const struct cdevsw *cdevsw_lookup(dev_t);
 devmajor_t bdevsw_lookup_major(const struct bdevsw *);

>From bb260b45d7f37803a2ee535155a332448926f8bf Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 12 Jan 2022 01:05:24 +0000
Subject: [PATCH 02/32] driver(9): Fix synchronization of
 devsw_attach/lookup/detach.

(`dev' means either `bdev' or `cdev' for brevity here, e.g. in
`devsw_lookup' (bdevsw_lookup, cdevsw_lookup), `dev_open' (bdev_open,
cdev_open), `maxdevsws', &c., except for `devsw_attach' and
`devsw_detach' which are taken literally.)

- Use atomic_store_release and atomic_load_consume for devsw and
  tables and their entries, which are read unlocked and thus require
  memory barriers to ensure ordering between initialization in
  devsw_attach and use in dev_lookup.

- Use pserialize(9) and localcount(9) to synchronize dev_open and
  devsw_detach.

  => Driver must ensure d_open fails and all open instances have been
     closed by the time it calls devsw_detach.

  => Bonus: dev_open is no longer globally serialized through
     device_lock.

- Use atomic_store_release and atomic_load_acquire for max_devsws,
  which is used in conditionals in the new devsw_lookup_acquire.

  => It is safe to use atomic_load_relaxed in devsw_lookup because
     the caller must guarantee the entry is stable, so any increase
     of max_devsws must have already happened.

  => devsw_lookup and devsw_lookup_acquire assume that max_devsws
     never goes down.  If you change this you must find some way to
     adapt the users, preferably without adding much overhead so that
     devsw operations are cheap.

This change introduces an auxiliary table devswref mapping device
majors to localcounts of opens in progress.  The auxiliary table only
occupies one pointer's worth of memory in a monolithic kernel, and is
allocated on the fly for dynamically loaded modules.  We could ask
the module itself to reserve storage for it, but I don't see much
value in that, and it would require some changes to the ABI and to
config(8).
---
 sys/kern/subr_devsw.c | 300 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 246 insertions(+), 54 deletions(-)

diff --git a/sys/kern/subr_devsw.c b/sys/kern/subr_devsw.c
index b95a09d3e6b6..9b19b89b3538 100644
--- a/sys/kern/subr_devsw.c
+++ b/sys/kern/subr_devsw.c
@@ -85,6 +85,9 @@ __KERNEL_RCSID(0, "$NetBSD: subr_devsw.c,v 1.38 2017/11/07 18:35:57 christos Exp
 #include <sys/buf.h>
 #include <sys/reboot.h>
 #include <sys/sdt.h>
+#include <sys/atomic.h>
+#include <sys/localcount.h>
+#include <sys/pserialize.h>
 
 #ifdef DEVSW_DEBUG
 #define	DPRINTF(x)	printf x
@@ -97,12 +100,23 @@ __KERNEL_RCSID(0, "$NetBSD: subr_devsw.c,v 1.38 2017/11/07 18:35:57 christos Exp
 #define	CDEVSW_SIZE	(sizeof(struct cdevsw *))
 #define	DEVSWCONV_SIZE	(sizeof(struct devsw_conv))
 
+struct devswref {
+	struct localcount	*dr_lc;
+	bool			dr_dynamic;
+};
+
+/* XXX bdevsw, cdevsw, max_bdevsws, and max_cdevsws should be volatile */
 extern const struct bdevsw **bdevsw, *bdevsw0[];
 extern const struct cdevsw **cdevsw, *cdevsw0[];
 extern struct devsw_conv *devsw_conv, devsw_conv0[];
 extern const int sys_bdevsws, sys_cdevsws;
 extern int max_bdevsws, max_cdevsws, max_devsw_convs;
 
+static struct devswref *cdevswref;
+static struct devswref *bdevswref;
+static struct pserialize *devsw_psz;
+static kcondvar_t devsw_cv;
+
 static int bdevsw_attach(const struct bdevsw *, devmajor_t *);
 static int cdevsw_attach(const struct cdevsw *, devmajor_t *);
 static void devsw_detach_locked(const struct bdevsw *, const struct cdevsw *);
@@ -118,6 +132,9 @@ devsw_init(void)
 	KASSERT(sys_bdevsws < MAXDEVSW - 1);
 	KASSERT(sys_cdevsws < MAXDEVSW - 1);
 	mutex_init(&device_lock, MUTEX_DEFAULT, IPL_NONE);
+
+	devsw_psz = pserialize_create();
+	cv_init(&devsw_cv, "devsw");
 }
 
 int
@@ -158,15 +175,12 @@ devsw_attach(const char *devname,
 			error = EEXIST;
 			goto fail;
 		}
-
-		if (bdev != NULL)
-			bdevsw[*bmajor] = bdev;
-		cdevsw[*cmajor] = cdev;
-
-		mutex_exit(&device_lock);
-		return (0);
 	}
 
+	/*
+	 * XXX This should allocate what it needs up front so we never
+	 * need to flail around trying to unwind.
+	 */
 	error = bdevsw_attach(bdev, bmajor);
 	if (error != 0) 
 		goto fail;
@@ -176,6 +190,13 @@ devsw_attach(const char *devname,
 		goto fail;
 	}
 
+	/*
+	 * If we already found a conv, we're done.  Otherwise, find an
+	 * empty slot or extend the table.
+	 */
+	if (i == max_devsw_convs)
+		goto fail;
+
 	for (i = 0 ; i < max_devsw_convs ; i++) {
 		if (devsw_conv[i].d_name == NULL)
 			break;
@@ -224,7 +245,9 @@ devsw_attach(const char *devname,
 static int
 bdevsw_attach(const struct bdevsw *devsw, devmajor_t *devmajor)
 {
-	const struct bdevsw **newptr;
+	const struct bdevsw **newbdevsw = NULL;
+	struct devswref *newbdevswref = NULL;
+	struct localcount *lc;
 	devmajor_t bmajor;
 	int i;
 
@@ -253,20 +276,35 @@ bdevsw_attach(const struct bdevsw *devsw, devmajor_t *devmajor)
 		return (ENOMEM);
 	}
 
+	if (bdevswref == NULL) {
+		newbdevswref = kmem_zalloc(MAXDEVSW * sizeof(newbdevswref[0]),
+		    KM_NOSLEEP);
+		if (newbdevswref == NULL)
+			return ENOMEM;
+		atomic_store_release(&bdevswref, newbdevswref);
+	}
+
 	if (*devmajor >= max_bdevsws) {
 		KASSERT(bdevsw == bdevsw0);
-		newptr = kmem_zalloc(MAXDEVSW * BDEVSW_SIZE, KM_NOSLEEP);
-		if (newptr == NULL)
-			return (ENOMEM);
-		memcpy(newptr, bdevsw, max_bdevsws * BDEVSW_SIZE);
-		bdevsw = newptr;
-		max_bdevsws = MAXDEVSW;
+		newbdevsw = kmem_zalloc(MAXDEVSW * sizeof(newbdevsw[0]),
+		    KM_NOSLEEP);
+                if (newbdevsw == NULL)
+                        return ENOMEM;
+		memcpy(newbdevsw, bdevsw, max_bdevsws * sizeof(bdevsw[0]));
+		atomic_store_release(&bdevsw, newbdevsw);
+		atomic_store_release(&max_bdevsws, MAXDEVSW);
 	}
 
 	if (bdevsw[*devmajor] != NULL)
 		return (EEXIST);
 
-	bdevsw[*devmajor] = devsw;
+	KASSERT(bdevswref[*devmajor].dr_lc == NULL);
+	lc = kmem_zalloc(sizeof(*lc), KM_SLEEP);
+	localcount_init(lc);
+	bdevswref[*devmajor].dr_lc = lc;
+	bdevswref[*devmajor].dr_dynamic = true;
+
+	atomic_store_release(&bdevsw[*devmajor], devsw);
 
 	return (0);
 }
@@ -274,7 +312,9 @@ bdevsw_attach(const struct bdevsw *devsw, devmajor_t *devmajor)
 static int
 cdevsw_attach(const struct cdevsw *devsw, devmajor_t *devmajor)
 {
-	const struct cdevsw **newptr;
+	const struct cdevsw **newcdevsw = NULL;
+	struct devswref *newcdevswref = NULL;
+	struct localcount *lc;
 	devmajor_t cmajor;
 	int i;
 
@@ -300,20 +340,35 @@ cdevsw_attach(const struct cdevsw *devsw, devmajor_t *devmajor)
 		return (ENOMEM);
 	}
 
+	if (cdevswref == NULL) {
+		newcdevswref = kmem_zalloc(MAXDEVSW * sizeof(newcdevswref[0]),
+		    KM_NOSLEEP);
+		if (newcdevswref == NULL)
+			return ENOMEM;
+		atomic_store_release(&cdevswref, newcdevswref);
+	}
+
 	if (*devmajor >= max_cdevsws) {
 		KASSERT(cdevsw == cdevsw0);
-		newptr = kmem_zalloc(MAXDEVSW * CDEVSW_SIZE, KM_NOSLEEP);
-		if (newptr == NULL)
-			return (ENOMEM);
-		memcpy(newptr, cdevsw, max_cdevsws * CDEVSW_SIZE);
-		cdevsw = newptr;
-		max_cdevsws = MAXDEVSW;
+		newcdevsw = kmem_zalloc(MAXDEVSW * sizeof(newcdevsw[0]),
+		    KM_NOSLEEP);
+                if (newcdevsw == NULL)
+                        return ENOMEM;
+		memcpy(newcdevsw, cdevsw, max_cdevsws * sizeof(cdevsw[0]));
+		atomic_store_release(&cdevsw, newcdevsw);
+		atomic_store_release(&max_cdevsws, MAXDEVSW);
 	}
 
 	if (cdevsw[*devmajor] != NULL)
 		return (EEXIST);
 
-	cdevsw[*devmajor] = devsw;
+	KASSERT(cdevswref[*devmajor].dr_lc == NULL);
+	lc = kmem_zalloc(sizeof(*lc), KM_SLEEP);
+	localcount_init(lc);
+	cdevswref[*devmajor].dr_lc = lc;
+	cdevswref[*devmajor].dr_dynamic = true;
+
+	atomic_store_release(&cdevsw[*devmajor], devsw);
 
 	return (0);
 }
@@ -321,25 +376,62 @@ cdevsw_attach(const struct cdevsw *devsw, devmajor_t *devmajor)
 static void
 devsw_detach_locked(const struct bdevsw *bdev, const struct cdevsw *cdev)
 {
-	int i;
+	int bi, ci = -1/*XXXGCC*/;
 
 	KASSERT(mutex_owned(&device_lock));
 
+	/* Prevent new references.  */
 	if (bdev != NULL) {
-		for (i = 0 ; i < max_bdevsws ; i++) {
-			if (bdevsw[i] != bdev)
+		for (bi = 0; bi < max_bdevsws; bi++) {
+			if (bdevsw[bi] != bdev)
 				continue;
-			bdevsw[i] = NULL;
+			atomic_store_relaxed(&bdevsw[bi], NULL);
 			break;
 		}
+		KASSERT(bi < max_bdevsws);
 	}
 	if (cdev != NULL) {
-		for (i = 0 ; i < max_cdevsws ; i++) {
-			if (cdevsw[i] != cdev)
+		for (ci = 0; ci < max_cdevsws; ci++) {
+			if (cdevsw[ci] != cdev)
 				continue;
-			cdevsw[i] = NULL;
+			atomic_store_relaxed(&cdevsw[ci], NULL);
 			break;
 		}
+		KASSERT(ci < max_cdevsws);
+	}
+
+	if (bdev == NULL && cdev == NULL) /* XXX possible? */
+		return;
+
+	/*
+	 * Wait for all bdevsw_lookup_acquire, cdevsw_lookup_acquire
+	 * calls to notice that the devsw is gone.
+	 */
+	pserialize_perform(devsw_psz);
+
+	/*
+	 * Wait for all references to drain.  It is the caller's
+	 * responsibility to ensure that at this point, there are no
+	 * extant open instances and all new d_open calls will fail.
+	 *
+	 * Note that localcount_drain may release and reacquire
+	 * device_lock.
+	 */
+	if (bdev != NULL) {
+		localcount_drain(bdevswref[bi].dr_lc,
+		    &devsw_cv, &device_lock);
+		localcount_fini(bdevswref[bi].dr_lc);
+		kmem_free(bdevswref[bi].dr_lc, sizeof(*bdevswref[bi].dr_lc));
+		bdevswref[bi].dr_lc = NULL;
+		bdevswref[bi].dr_dynamic = false;
+	}
+	if (cdev != NULL) {
+		localcount_drain(cdevswref[ci].dr_lc,
+		    &devsw_cv, &device_lock);
+		localcount_fini(cdevswref[ci].dr_lc);
+		kmem_free(cdevswref[ci].dr_lc, sizeof(*cdevswref[ci].dr_lc));
+		cdevswref[ci].dr_lc = NULL;
+		cdevswref[ci].dr_dynamic = false;
 	}
 }
 
@@ -365,10 +457,60 @@ bdevsw_lookup(dev_t dev)
 	if (dev == NODEV)
 		return (NULL);
 	bmajor = major(dev);
-	if (bmajor < 0 || bmajor >= max_bdevsws)
+	if (bmajor < 0 || bmajor >= atomic_load_relaxed(&max_bdevsws))
 		return (NULL);
 
-	return (bdevsw[bmajor]);
+	return atomic_load_consume(&bdevsw)[bmajor];
+}
+
+static const struct bdevsw *
+bdevsw_lookup_acquire(dev_t dev, struct localcount **lcp)
+{
+	devmajor_t bmajor;
+	const struct bdevsw *bdev = NULL, *const *curbdevsw;
+	struct devswref *curbdevswref;
+	int s;
+
+	if (dev == NODEV)
+		return NULL;
+	bmajor = major(dev);
+	if (bmajor < 0)
+		return NULL;
+
+	s = pserialize_read_enter();
+
+	/*
+	 * max_bdevsws never goes down, so it is safe to rely on this
+	 * condition without any locking for the array access below.
+	 * Test sys_bdevsws first so we can avoid the memory barrier in
+	 * that case.
+	 */
+	if (bmajor >= sys_bdevsws &&
+	    bmajor >= atomic_load_acquire(&max_bdevsws))
+		goto out;
+	curbdevsw = atomic_load_consume(&bdevsw);
+	if ((bdev = atomic_load_consume(&curbdevsw[bmajor])) == NULL)
+		goto out;
+
+	curbdevswref = atomic_load_consume(&bdevswref);
+	if (curbdevswref == NULL || !curbdevswref[bmajor].dr_dynamic) {
+		*lcp = NULL;
+	} else {
+		*lcp = curbdevswref[bmajor].dr_lc;
+		localcount_acquire(*lcp);
+	}
+
+out:	pserialize_read_exit(s);
+	return bdev;
+}
+
+static void
+bdevsw_release(const struct bdevsw *bdev, struct localcount *lc)
+{
+
+	if (lc == NULL)
+		return;
+	localcount_release(lc, &devsw_cv, &device_lock);
 }
 
 /*
@@ -384,10 +526,60 @@ cdevsw_lookup(dev_t dev)
 	if (dev == NODEV)
 		return (NULL);
 	cmajor = major(dev);
-	if (cmajor < 0 || cmajor >= max_cdevsws)
+	if (cmajor < 0 || cmajor >= atomic_load_relaxed(&max_cdevsws))
 		return (NULL);
 
-	return (cdevsw[cmajor]);
+	return atomic_load_consume(&cdevsw)[cmajor];
+}
+
+static const struct cdevsw *
+cdevsw_lookup_acquire(dev_t dev, struct localcount **lcp)
+{
+	devmajor_t cmajor;
+	const struct cdevsw *cdev = NULL, *const *curcdevsw;
+	struct devswref *curcdevswref;
+	int s;
+
+	if (dev == NODEV)
+		return NULL;
+	cmajor = major(dev);
+	if (cmajor < 0)
+		return NULL;
+
+	s = pserialize_read_enter();
+
+	/*
+	 * max_cdevsws never goes down, so it is safe to rely on this
+	 * condition without any locking for the array access below.
+	 * Test sys_cdevsws first so we can avoid the memory barrier in
+	 * that case.
+	 */
+	if (cmajor >= sys_cdevsws &&
+	    cmajor >= atomic_load_acquire(&max_cdevsws))
+		goto out;
+	curcdevsw = atomic_load_consume(&cdevsw);
+	if ((cdev = atomic_load_consume(&curcdevsw[cmajor])) == NULL)
+		goto out;
+
+	curcdevswref = atomic_load_consume(&cdevswref);
+	if (curcdevswref == NULL || !curcdevswref[cmajor].dr_dynamic) {
+		*lcp = NULL;
+	} else {
+		*lcp = curcdevswref[cmajor].dr_lc;
+		localcount_acquire(*lcp);
+	}
+
+out:	pserialize_read_exit(s);
+	return cdev;
+}
+
+static void
+cdevsw_release(const struct cdevsw *cdev, struct localcount *lc)
+{
+
+	if (lc == NULL)
+		return;
+	localcount_release(lc, &devsw_cv, &device_lock);
 }
 
 /*
@@ -399,10 +591,13 @@ cdevsw_lookup(dev_t dev)
 devmajor_t
 bdevsw_lookup_major(const struct bdevsw *bdev)
 {
-	devmajor_t bmajor;
+	const struct bdevsw *const *curbdevsw;
+	devmajor_t bmajor, bmax;
 
-	for (bmajor = 0 ; bmajor < max_bdevsws ; bmajor++) {
-		if (bdevsw[bmajor] == bdev)
+	bmax = atomic_load_acquire(&max_bdevsws);
+	curbdevsw = atomic_load_consume(&bdevsw);
+	for (bmajor = 0; bmajor < bmax; bmajor++) {
+		if (atomic_load_relaxed(&curbdevsw[bmajor]) == bdev)
 			return (bmajor);
 	}
 
@@ -418,10 +613,13 @@ bdevsw_lookup_major(const struct bdevsw *bdev)
 devmajor_t
 cdevsw_lookup_major(const struct cdevsw *cdev)
 {
-	devmajor_t cmajor;
+	const struct cdevsw *const *curcdevsw;
+	devmajor_t cmajor, cmax;
 
-	for (cmajor = 0 ; cmajor < max_cdevsws ; cmajor++) {
-		if (cdevsw[cmajor] == cdev)
+	cmax = atomic_load_acquire(&max_cdevsws);
+	curcdevsw = atomic_load_consume(&cdevsw);
+	for (cmajor = 0; cmajor < cmax; cmajor++) {
+		if (atomic_load_relaxed(&curcdevsw[cmajor]) == cdev)
 			return (cmajor);
 	}
 
@@ -696,15 +894,10 @@ int
 bdev_open(dev_t dev, int flag, int devtype, lwp_t *l)
 {
 	const struct bdevsw *d;
+	struct localcount *lc;
 	int rv, mpflag;
 
-	/*
-	 * For open we need to lock, in order to synchronize
-	 * with attach/detach.
-	 */
-	mutex_enter(&device_lock);
-	d = bdevsw_lookup(dev);
-	mutex_exit(&device_lock);
+	d = bdevsw_lookup_acquire(dev, &lc);
 	if (d == NULL)
 		return ENXIO;
 
@@ -712,6 +905,8 @@ bdev_open(dev_t dev, int flag, int devtype, lwp_t *l)
 	rv = (*d->d_open)(dev, flag, devtype, l);
 	DEV_UNLOCK(d);
 
+	bdevsw_release(d, lc);
+
 	return rv;
 }
 
@@ -854,15 +1049,10 @@ int
 cdev_open(dev_t dev, int flag, int devtype, lwp_t *l)
 {
 	const struct cdevsw *d;
+	struct localcount *lc;
 	int rv, mpflag;
 
-	/*
-	 * For open we need to lock, in order to synchronize
-	 * with attach/detach.
-	 */
-	mutex_enter(&device_lock);
-	d = cdevsw_lookup(dev);
-	mutex_exit(&device_lock);
+	d = cdevsw_lookup_acquire(dev, &lc);
 	if (d == NULL)
 		return ENXIO;
 
@@ -870,6 +1060,8 @@ cdev_open(dev_t dev, int flag, int devtype, lwp_t *l)
 	rv = (*d->d_open)(dev, flag, devtype, l);
 	DEV_UNLOCK(d);
 
+	cdevsw_release(d, lc);
+
 	return rv;
 }
 

>From a15e7ffa1a1f91a54e62b0ebd576fbac94743613 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sat, 15 Jan 2022 20:01:53 +0000
Subject: [PATCH 03/32] driver(9): Use xcall(9) rather than pserialize(9) for
 devsw detach.

devsw_init is called too early for the kmem(9) in pserialize_create,
but pserialize_create doesn't do anything useful anyway.
---
 sys/kern/subr_devsw.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/sys/kern/subr_devsw.c b/sys/kern/subr_devsw.c
index 9b19b89b3538..ee2bb1cb1009 100644
--- a/sys/kern/subr_devsw.c
+++ b/sys/kern/subr_devsw.c
@@ -88,6 +88,7 @@ __KERNEL_RCSID(0, "$NetBSD: subr_devsw.c,v 1.38 2017/11/07 18:35:57 christos Exp
 #include <sys/atomic.h>
 #include <sys/localcount.h>
 #include <sys/pserialize.h>
+#include <sys/xcall.h>
 
 #ifdef DEVSW_DEBUG
 #define	DPRINTF(x)	printf x
@@ -114,7 +115,6 @@ extern int max_bdevsws, max_cdevsws, max_devsw_convs;
 
 static struct devswref *cdevswref;
 static struct devswref *bdevswref;
-static struct pserialize *devsw_psz;
 static kcondvar_t devsw_cv;
 
 static int bdevsw_attach(const struct bdevsw *, devmajor_t *);
@@ -133,7 +133,6 @@ devsw_init(void)
 	KASSERT(sys_cdevsws < MAXDEVSW - 1);
 	mutex_init(&device_lock, MUTEX_DEFAULT, IPL_NONE);
 
-	devsw_psz = pserialize_create();
 	cv_init(&devsw_cv, "devsw");
 }
 
@@ -406,8 +405,11 @@ devsw_detach_locked(const struct bdevsw *bdev, const struct cdevsw *cdev)
 	/*
 	 * Wait for all bdevsw_lookup_acquire, cdevsw_lookup_acquire
 	 * calls to notice that the devsw is gone.
+	 *
+	 * XXX Can't use pserialize_perform here because devsw_init is
+	 * too early for pserialize_create().
 	 */
-	pserialize_perform(devsw_psz);
+	xc_barrier(0);
 
 	/*
 	 * Wait for all references to drain.  It is the caller's

>From e41262e7897297d247dc1441e0b47bb01512a49d Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 12 Jan 2022 14:07:35 +0000
Subject: [PATCH 04/32] autoconf(9): New localcount-based device instance
 references.

device_lookup_acquire looks up an autoconf device instance, if found,
and acquires a reference the caller must release with device_release.
If attach or detach is still in progress, device_lookup_acquire waits
until it completes.  While references are held, the device's softc
will not be freed or reused until the last reference is released.

The reference is meant to be held while opening a device in the short
term, and then to be passed off to a longer-term reference that can
be broken explicitly by detach -- usually a device special vnode,
which is broken by vdevgone in the driver's *_detach function.

Sleeping while holding a reference is allowed, e.g. waiting to open a
tty.  A driver must arrange that its *_detach function will interrupt
any threads sleeping while holding references and cause them to back
out so that detach can complete promptly.

Subsequent changes to subr_devsw.c will make bdev_open and cdev_open
automatically take a reference to an autoconf instance for drivers
that opt into this, so there will be no logic changes needed in most
drivers other than to connect the autoconf cfdriver to the
bdevsw/cdevsw I/O operation tables.  The effect will be that *_detach
may run while d_open is in progress, but no new d_open can begin
until *_detach has backed out from or committed to detaching.
---
 sys/kern/subr_autoconf.c | 175 ++++++++++++++++++++++++++++++++++++---
 sys/sys/device.h         |   6 ++
 2 files changed, 169 insertions(+), 12 deletions(-)

diff --git a/sys/kern/subr_autoconf.c b/sys/kern/subr_autoconf.c
index 8439ae8c6a00..8c0b66556e0f 100644
--- a/sys/kern/subr_autoconf.c
+++ b/sys/kern/subr_autoconf.c
@@ -108,6 +108,7 @@ __KERNEL_RCSID(0, "$NetBSD: subr_autoconf.c,v 1.291 2021/12/31 14:19:57 riastrad
 #include <sys/cpu.h>
 #include <sys/sysctl.h>
 #include <sys/stdarg.h>
+#include <sys/localcount.h>
 
 #include <sys/disk.h>
 
@@ -1453,6 +1454,9 @@ config_devdelete(device_t dev)
 	if (dg->dg_devs != NULL)
 		kmem_free(dg->dg_devs, sizeof(device_t) * dg->dg_ndevs);
 
+	localcount_fini(dev->dv_localcount);
+	kmem_free(dev->dv_localcount, sizeof(*dev->dv_localcount));
+
 	cv_destroy(&dvl->dvl_cv);
 	mutex_destroy(&dvl->dvl_mtx);
 
@@ -1556,6 +1560,7 @@ config_devalloc(const device_t parent, const cfdata_t cf,
 	dev->dv_activity_handlers = NULL;
 	dev->dv_private = dev_private;
 	dev->dv_flags = ca->ca_flags;	/* inherit flags from class */
+	dev->dv_attaching = curlwp;
 
 	myunit = config_unit_alloc(dev, cd, cf);
 	if (myunit == -1) {
@@ -1604,6 +1609,10 @@ config_devalloc(const device_t parent, const cfdata_t cf,
 		    "device-parent", device_xname(parent));
 	}
 
+	dev->dv_localcount = kmem_zalloc(sizeof(*dev->dv_localcount),
+	    KM_SLEEP);
+	localcount_init(dev->dv_localcount);
+
 	if (dev->dv_cfdriver->cd_attrs != NULL)
 		config_add_attrib_dict(dev);
 
@@ -1755,8 +1764,29 @@ config_attach_internal(device_t parent, cfdata_t cf, void *aux, cfprint_t print,
 	/* Let userland know */
 	devmon_report_device(dev, true);
 
+	/*
+	 * Prevent detach until the driver's attach function, and all
+	 * deferred actions, have finished.
+	 */
 	config_pending_incr(dev);
+
+	/* Call the driver's attach function.  */
 	(*dev->dv_cfattach->ca_attach)(parent, dev, aux);
+
+	/*
+	 * Allow other threads to acquire references to the device now
+	 * that the driver's attach function is done.
+	 */
+	mutex_enter(&config_misc_lock);
+	KASSERT(dev->dv_attaching == curlwp);
+	dev->dv_attaching = NULL;
+	cv_broadcast(&config_misc_cv);
+	mutex_exit(&config_misc_lock);
+
+	/*
+	 * Synchronous parts of attach are done.  Allow detach, unless
+	 * the driver's attach function scheduled deferred actions.
+	 */
 	config_pending_decr(dev);
 
 	mutex_enter(&config_misc_lock);
@@ -1822,8 +1852,29 @@ config_attach_pseudo(cfdata_t cf)
 	/* Let userland know */
 	devmon_report_device(dev, true);
 
+	/*
+	 * Prevent detach until the driver's attach function, and all
+	 * deferred actions, have finished.
+	 */
 	config_pending_incr(dev);
+
+	/* Call the driver's attach function.  */
 	(*dev->dv_cfattach->ca_attach)(ROOT, dev, NULL);
+
+	/*
+	 * Allow other threads to acquire references to the device now
+	 * that the driver's attach function is done.
+	 */
+	mutex_enter(&config_misc_lock);
+	KASSERT(dev->dv_attaching == curlwp);
+	dev->dv_attaching = NULL;
+	cv_broadcast(&config_misc_cv);
+	mutex_exit(&config_misc_lock);
+
+	/*
+	 * Synchronous parts of attach are done.  Allow detach, unless
+	 * the driver's attach function scheduled deferred actions.
+	 */
 	config_pending_decr(dev);
 
 	config_process_deferred(&deferred_config_queue, dev);
@@ -1872,24 +1923,39 @@ config_dump_garbage(struct devicelist *garbage)
 static int
 config_detach_enter(device_t dev)
 {
-	int error;
+	int error = 0;
 
 	mutex_enter(&config_misc_lock);
-	for (;;) {
-		if (dev->dv_pending == 0 && dev->dv_detaching == NULL) {
-			dev->dv_detaching = curlwp;
-			error = 0;
-			break;
-		}
+
+	/*
+	 * Wait until attach has fully completed, and until any
+	 * concurrent detach (e.g., drvctl racing with USB event
+	 * thread) has completed.
+	 *
+	 * Caller must hold alldevs_nread or alldevs_nwrite (e.g., via
+	 * deviter) to ensure the winner of the race doesn't free the
+	 * device leading the loser of the race into use-after-free.
+	 *
+	 * XXX Not all callers do this!
+	 */
+	while (dev->dv_pending || dev->dv_detaching) {
 		KASSERTMSG(dev->dv_detaching != curlwp,
 		    "recursively detaching %s", device_xname(dev));
 		error = cv_wait_sig(&config_misc_cv, &config_misc_lock);
 		if (error)
-			break;
+			goto out;
 	}
-	KASSERT(error || dev->dv_detaching == curlwp);
-	mutex_exit(&config_misc_lock);
 
+	/*
+	 * Attach has completed, and no other concurrent detach is
+	 * running.  Claim the device for detaching.  This will cause
+	 * all new attempts to acquire references to block.
+	 */
+	KASSERT(dev->dv_attaching == NULL);
+	KASSERT(dev->dv_detaching == NULL);
+	dev->dv_detaching = curlwp;
+
+out:	mutex_exit(&config_misc_lock);
 	return error;
 }
 
@@ -1980,9 +2046,10 @@ config_detach(device_t dev, int flags)
 	 */
 	if (rv == 0)
 		dev->dv_flags &= ~DVF_ACTIVE;
-	else if ((flags & DETACH_FORCE) == 0)
+	else if ((flags & DETACH_FORCE) == 0) {
+		/* Detach failed -- likely EBUSY.  */
 		goto out;
-	else {
+	} else {
 		panic("config_detach: forced detach of %s failed (%d)",
 		    device_xname(dev), rv);
 	}
@@ -1991,6 +2058,19 @@ config_detach(device_t dev, int flags)
 	 * The device has now been successfully detached.
 	 */
 
+	/*
+	 * Wait for all device_lookup_acquire references -- mostly, for
+	 * all attempts to open the device -- to drain.  It is the
+	 * responsibility of .ca_detach to ensure anything with open
+	 * references will be interrupted and release them promptly,
+	 * not block indefinitely.  All new attempts to acquire
+	 * references will block until dv_detaching clears.
+	 */
+	mutex_enter(&config_misc_lock);
+	localcount_drain(dev->dv_localcount,
+	    &config_misc_cv, &config_misc_lock);
+	mutex_exit(&config_misc_lock);
+
 	/* Let userland know */
 	devmon_report_device(dev, false);
 
@@ -2498,6 +2578,14 @@ config_alldevs_exit(struct alldevs_foray *af)
  * device_lookup:
  *
  *	Look up a device instance for a given driver.
+ *
+ *	Caller is responsible for ensuring the device's state is
+ *	stable, either by holding a reference already obtained with
+ *	device_lookup_acquire or by otherwise ensuring the device is
+ *	attached and can't be detached (e.g., holding an open device
+ *	node and ensuring *_detach calls vdevgone).
+ *
+ *	XXX Find a way to assert this.
  */
 device_t
 device_lookup(cfdriver_t cd, int unit)
@@ -2526,6 +2614,69 @@ device_lookup_private(cfdriver_t cd, int unit)
 	return device_private(device_lookup(cd, unit));
 }
 
+/*
+ * device_lookup_acquire:
+ *
+ *	Look up a device instance for a given driver, and return a
+ *	reference to it that must be released by device_release.
+ *
+ *	=> If the device is still attaching, blocks until *_attach has
+ *	   returned.
+ *
+ *	=> If the device is detaching, blocks until *_detach has
+ *	   returned.  May succeed or fail in that case, depending on
+ *	   whether *_detach has backed out (EBUSY) or committed to
+ *	   detaching.
+ */
+device_t
+device_lookup_acquire(cfdriver_t cd, int unit)
+{
+	device_t dv;
+
+	/* XXX This should have a pserialized fast path -- TBD.  */
+	mutex_enter(&config_misc_lock);
+	mutex_enter(&alldevs_lock);
+retry:	if (unit < 0 || unit >= cd->cd_ndevs ||
+	    (dv = cd->cd_devs[unit]) == NULL ||
+	    dv->dv_del_gen != 0) {
+		dv = NULL;
+	} else {
+		/*
+		 * Wait for the device to stabilize, if attaching or
+		 * detaching.  Either way we must wait for *_attach or
+		 * *_detach to complete, and either way we must retry:
+		 * even if detaching, *_detach might fail (EBUSY) so
+		 * the device may still be there.
+		 */
+		if ((dv->dv_attaching != NULL && dv->dv_attaching != curlwp) ||
+		    dv->dv_detaching != NULL) {
+			mutex_exit(&alldevs_lock);
+			cv_wait(&config_misc_cv, &config_misc_lock);
+			mutex_enter(&alldevs_lock);
+			goto retry;
+		}
+		localcount_acquire(dv->dv_localcount);
+	}
+	mutex_exit(&alldevs_lock);
+	mutex_exit(&config_misc_lock);
+
+	return dv;
+}
+
+/*
+ * device_release:
+ *
+ *	Release a reference to a device acquired with
+ *	device_lookup_acquire.
+ */
+void
+device_release(device_t dv)
+{
+
+	localcount_release(dv->dv_localcount,
+	    &config_misc_cv, &config_misc_lock);
+}
+
 /*
  * device_find_by_xname:
  *
diff --git a/sys/sys/device.h b/sys/sys/device.h
index 3bd4a6c3abf7..e685419d4925 100644
--- a/sys/sys/device.h
+++ b/sys/sys/device.h
@@ -274,10 +274,12 @@ struct device {
 	void		*dv_private;	/* this device's private storage */
 	int		*dv_locators;	/* our actual locators (optional) */
 	prop_dictionary_t dv_properties;/* properties dictionary */
+	struct localcount *dv_localcount;/* reference count */
 
 	int		dv_pending;	/* config_pending count */
 	TAILQ_ENTRY(device) dv_pending_list;
 
+	struct lwp	*dv_attaching;	/* thread not yet finished in attach */
 	struct lwp	*dv_detaching;	/* detach lock (config_misc_lock/cv) */
 
 	size_t		dv_activity_count;
@@ -651,6 +653,10 @@ void	null_childdetached(device_t, device_t);
 
 device_t	device_lookup(cfdriver_t, int);
 void		*device_lookup_private(cfdriver_t, int);
+
+device_t	device_lookup_acquire(cfdriver_t, int);
+void		device_release(device_t);
+
 void		device_register(device_t, void *);
 void		device_register_post_config(device_t, void *);
 

>From b41960975941304bfea3886894dc28182bf25d32 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 12 Jan 2022 14:18:57 +0000
Subject: [PATCH 05/32] driver(9): New devsw members d_cfdriver, d_devtounit.

If set, then bdev_open/cdev_open will use d_devtounit to map the
dev_t to an autoconf instance (e.g., /dev/wd0a -> wd0) and hold a
reference with device_lookup_acquire across the call to d_open.

This guarantees that the autoconf instance cannot be detached while
the devsw's d_open function is trying to open it (and also that the
autoconf instance has finished *_attach before anyone can open it).

Of course, if the underlying hardware has gone away, there will be
I/O errors, but this avoids software synchronization bugs between
open and detach for drivers that opt into it.  It's up to the driver
and bus to figure out how to deal with I/O errors from operations on
hardware that has gone away while the software hasn't finished
notifying everything that it's gone yet.
---
 sys/kern/subr_devsw.c | 49 +++++++++++++++++++++++++++++++++++++++++--
 sys/sys/conf.h        |  4 ++++
 2 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/sys/kern/subr_devsw.c b/sys/kern/subr_devsw.c
index ee2bb1cb1009..5b1c8d042a59 100644
--- a/sys/kern/subr_devsw.c
+++ b/sys/kern/subr_devsw.c
@@ -89,6 +89,7 @@ __KERNEL_RCSID(0, "$NetBSD: subr_devsw.c,v 1.38 2017/11/07 18:35:57 christos Exp
 #include <sys/localcount.h>
 #include <sys/pserialize.h>
 #include <sys/xcall.h>
+#include <sys/device.h>
 
 #ifdef DEVSW_DEBUG
 #define	DPRINTF(x)	printf x
@@ -897,16 +898,38 @@ bdev_open(dev_t dev, int flag, int devtype, lwp_t *l)
 {
 	const struct bdevsw *d;
 	struct localcount *lc;
-	int rv, mpflag;
+	device_t dv = NULL/*XXXGCC*/;
+	int unit, rv, mpflag;
 
 	d = bdevsw_lookup_acquire(dev, &lc);
 	if (d == NULL)
 		return ENXIO;
 
+	if (d->d_devtounit) {
+		/*
+		 * If the device node corresponds to an autoconf device
+		 * instance, acquire a reference to it so that during
+		 * d_open, device_lookup is stable.
+		 *
+		 * XXX This should also arrange to instantiate cloning
+		 * pseudo-devices if appropriate, but that requires
+		 * reviewing them all to find and verify a common
+		 * pattern.
+		 */
+		if ((unit = (*d->d_devtounit)(dev)) == -1)
+			return ENXIO;
+		if ((dv = device_lookup_acquire(d->d_cfdriver, unit)) == NULL)
+			return ENXIO;
+	}
+
 	DEV_LOCK(d);
 	rv = (*d->d_open)(dev, flag, devtype, l);
 	DEV_UNLOCK(d);
 
+	if (d->d_devtounit) {
+		device_release(dv);
+	}
+
 	bdevsw_release(d, lc);
 
 	return rv;
@@ -1052,16 +1075,38 @@ cdev_open(dev_t dev, int flag, int devtype, lwp_t *l)
 {
 	const struct cdevsw *d;
 	struct localcount *lc;
-	int rv, mpflag;
+	device_t dv = NULL/*XXXGCC*/;
+	int unit, rv, mpflag;
 
 	d = cdevsw_lookup_acquire(dev, &lc);
 	if (d == NULL)
 		return ENXIO;
 
+	if (d->d_devtounit) {
+		/*
+		 * If the device node corresponds to an autoconf device
+		 * instance, acquire a reference to it so that during
+		 * d_open, device_lookup is stable.
+		 *
+		 * XXX This should also arrange to instantiate cloning
+		 * pseudo-devices if appropriate, but that requires
+		 * reviewing them all to find and verify a common
+		 * pattern.
+		 */
+		if ((unit = (*d->d_devtounit)(dev)) == -1)
+			return ENXIO;
+		if ((dv = device_lookup_acquire(d->d_cfdriver, unit)) == NULL)
+			return ENXIO;
+	}
+
 	DEV_LOCK(d);
 	rv = (*d->d_open)(dev, flag, devtype, l);
 	DEV_UNLOCK(d);
 
+	if (d->d_devtounit) {
+		device_release(dv);
+	}
+
 	cdevsw_release(d, lc);
 
 	return rv;
diff --git a/sys/sys/conf.h b/sys/sys/conf.h
index ba46976a924b..12cfd3bf89a6 100644
--- a/sys/sys/conf.h
+++ b/sys/sys/conf.h
@@ -76,6 +76,8 @@ struct bdevsw {
 	int		(*d_dump)(dev_t, daddr_t, void *, size_t);
 	int		(*d_psize)(dev_t);
 	int		(*d_discard)(dev_t, off_t, off_t);
+	int		(*d_devtounit)(dev_t);
+	struct cfdriver	*d_cfdriver;
 	int		d_flag;
 };
 
@@ -94,6 +96,8 @@ struct cdevsw {
 	paddr_t		(*d_mmap)(dev_t, off_t, int);
 	int		(*d_kqfilter)(dev_t, struct knote *);
 	int		(*d_discard)(dev_t, off_t, off_t);
+	int		(*d_devtounit)(dev_t);
+	struct cfdriver	*d_cfdriver;
 	int		d_flag;
 };
 

>From 40995286ab92a1bfbd067cac193ebde5ad753d52 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sat, 15 Jan 2022 19:11:10 +0000
Subject: [PATCH 06/32] disk(9): New function disklabel_dev_unit.

Maps a dev_t like wd3e to an autoconf instance number like 3, with no
partition.  Same as DISKUNIT macro, but is a symbol whose pointer can
be taken.  Meant for use with struct bdevsw, cdevsw::d_devtounit.
---
 sys/kern/subr_disk.c | 7 +++++++
 sys/sys/disklabel.h  | 1 +
 2 files changed, 8 insertions(+)

diff --git a/sys/kern/subr_disk.c b/sys/kern/subr_disk.c
index da664f920382..41218421db57 100644
--- a/sys/kern/subr_disk.c
+++ b/sys/kern/subr_disk.c
@@ -728,3 +728,10 @@ disk_set_info(device_t dev, struct disk *dk, const char *type)
 	if (odisk_info)
 		prop_object_release(odisk_info);
 }
+
+int
+disklabel_dev_unit(dev_t dev)
+{
+
+	return DISKUNIT(dev);
+}
diff --git a/sys/sys/disklabel.h b/sys/sys/disklabel.h
index 4e94b8671332..853cdbe668a3 100644
--- a/sys/sys/disklabel.h
+++ b/sys/sys/disklabel.h
@@ -509,6 +509,7 @@ const char *convertdisklabel(struct disklabel *, void (*)(struct buf *),
 int	 bounds_check_with_label(struct disk *, struct buf *, int);
 int	 bounds_check_with_mediasize(struct buf *, int, uint64_t);
 const char *getfstypename(int);
+int	disklabel_dev_unit(dev_t);
 #endif
 #endif /* _LOCORE */
 

>From 19708d9508604481b314758f98f8b531083bbfc4 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sat, 15 Jan 2022 19:18:09 +0000
Subject: [PATCH 07/32] driver(9): New function dev_minor_unit.

---
 sys/kern/subr_devsw.c | 15 +++++++++++++++
 sys/sys/conf.h        |  1 +
 2 files changed, 16 insertions(+)

diff --git a/sys/kern/subr_devsw.c b/sys/kern/subr_devsw.c
index 5b1c8d042a59..8b55187b32c1 100644
--- a/sys/kern/subr_devsw.c
+++ b/sys/kern/subr_devsw.c
@@ -1301,3 +1301,18 @@ nommap(dev_t dev, off_t off, int prot)
 
 	return (paddr_t)-1;
 }
+
+/*
+ * dev_minor_unit(dev)
+ *
+ *	Returns minor(dev) as an int.  Intended for use with struct
+ *	bdevsw, cdevsw::d_devtounit for drivers whose /dev nodes are
+ *	implemented by reference to an autoconf instance with the minor
+ *	number.
+ */
+int
+dev_minor_unit(dev_t dev)
+{
+
+	return minor(dev);
+}
diff --git a/sys/sys/conf.h b/sys/sys/conf.h
index 12cfd3bf89a6..cb4c287d1982 100644
--- a/sys/sys/conf.h
+++ b/sys/sys/conf.h
@@ -280,6 +280,7 @@ devmajor_t devsw_name2blk(const char *, char *, size_t);
 devmajor_t devsw_name2chr(const char *, char *, size_t);
 dev_t devsw_chr2blk(dev_t);
 dev_t devsw_blk2chr(dev_t);
+int dev_minor_unit(dev_t);
 
 void mm_init(void);
 #endif /* _KERNEL */

>From 53f7008baf7ef6c2c6c20ba4bdf03fbc5fbc765b Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Fri, 21 Jan 2022 03:07:36 +0000
Subject: [PATCH 08/32] driver(9): Eliminate D_MCLOSE.

D_MCLOSE was introduced a few years ago by mistake for audio(4),
which should have used -- and now does use -- fd_clone to create
per-open state.  The semantics was originally to call close once
every time the device node is closed, not only for the last close.
Nothing uses it any more, and it complicates reasoning about the
system, so let's simplify it away.
---
 sys/miscfs/specfs/spec_vnops.c | 2 +-
 sys/sys/conf.h                 | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index b4bc4c34ab03..7c51f9ef854d 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -1276,7 +1276,7 @@ spec_close(void *v)
 		sd->sd_bdevvp = NULL;
 	mutex_exit(&device_lock);
 
-	if (count != 0 && (vp->v_type != VCHR || !(cdev_flags(dev) & D_MCLOSE)))
+	if (count != 0)
 		return 0;
 
 	/*
diff --git a/sys/sys/conf.h b/sys/sys/conf.h
index cb4c287d1982..16dd87e5480c 100644
--- a/sys/sys/conf.h
+++ b/sys/sys/conf.h
@@ -63,7 +63,7 @@ struct vnode;
 #define	D_TYPEMASK	0x00ff
 #define	D_MPSAFE	0x0100
 #define	D_NEGOFFSAFE	0x0200
-#define	D_MCLOSE	0x0400
+#define	D_UNUSED0	0x0400	/* was D_MCLOSE */
 
 /*
  * Block device switch table

>From fa164452f13259200914eea59196643b25b15745 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 00:03:49 +0000
Subject: [PATCH 09/32] specfs: Call bdev_open without the vnode lock.

There is no need for it to serialize opens, because they are already
serialized by sd_opencnt which for block devices is always either 0
or 1.

There's not obviously any other reason why the vnode lock should be
held across bdev_open, other than that it might be nice to avoid
dropping it if not necessary.  For character devices we always have
to drop the vnode lock because open might hang indefinitely, when
opening a tty, which is not allowed while holding the vnode lock.
---
 sys/miscfs/specfs/spec_vnops.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 7c51f9ef854d..a85db1c3cf0b 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -609,6 +609,7 @@ spec_open(void *v)
 		sd->sd_opencnt = 1;
 		sd->sd_bdevvp = vp;
 		mutex_exit(&device_lock);
+		VOP_UNLOCK(vp);
 		do {
 			const struct bdevsw *bdev;
 
@@ -628,13 +629,10 @@ spec_open(void *v)
 			if ((name = bdevsw_getname(major(dev))) == NULL)
 				break;
 
-			VOP_UNLOCK(vp);
-
                         /* Try to autoload device module */
 			(void) module_autoload(name, MODULE_CLASS_DRIVER);
-			
-			vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 		} while (gen != module_gen);
+		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 
 		break;
 

>From 2bab207569850967027369fdf77308d7dd24dfa2 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 00:10:19 +0000
Subject: [PATCH 10/32] specfs: Assert v_type is VBLK or VCHR in spec_open.

Nothing else makes sense.  Prune dead branches (and replace default
case by panic).
---
 sys/miscfs/specfs/spec_vnops.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index a85db1c3cf0b..1c0d601fe58a 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -517,7 +517,10 @@ spec_open(void *v)
 	sd = sn->sn_dev;
 	name = NULL;
 	gen = 0;
-	
+
+	KASSERTMSG(vp->v_type == VBLK || vp->v_type == VCHR, "type=%d",
+	    vp->v_type);
+
 	/*
 	 * Don't allow open if fs is mounted -nodev.
 	 */
@@ -636,15 +639,8 @@ spec_open(void *v)
 
 		break;
 
-	case VNON:
-	case VLNK:
-	case VDIR:
-	case VREG:
-	case VBAD:
-	case VFIFO:
-	case VSOCK:
 	default:
-		return 0;
+		panic("invalid specfs vnode type: %d", vp->v_type);
 	}
 
 	mutex_enter(&device_lock);

>From 719c97a65ad64a67705676ee6d5e7f6a9b695c09 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 00:11:49 +0000
Subject: [PATCH 11/32] specfs: Factor common kauth check out of switch in
 spec_open.

No functional change.
---
 sys/miscfs/specfs/spec_vnops.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 1c0d601fe58a..ff9564a5a2b4 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -538,13 +538,12 @@ spec_open(void *v)
 		req = KAUTH_REQ_DEVICE_RAWIO_SPEC_READ;
 		break;
 	}
+	error = kauth_authorize_device_spec(ap->a_cred, req, vp);
+	if (error != 0)
+		return (error);
 
 	switch (vp->v_type) {
 	case VCHR:
-		error = kauth_authorize_device_spec(ap->a_cred, req, vp);
-		if (error != 0)
-			return (error);
-
 		/*
 		 * Character devices can accept opens from multiple
 		 * vnodes.
@@ -587,10 +586,6 @@ spec_open(void *v)
 		break;
 
 	case VBLK:
-		error = kauth_authorize_device_spec(ap->a_cred, req, vp);
-		if (error != 0)
-			return (error);
-
 		/*
 		 * For block devices, permit only one open.  The buffer
 		 * cache cannot remain self-consistent with multiple

>From ac9c94d4ff717700bc377ba413cdd7284473b6cd Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 00:34:59 +0000
Subject: [PATCH 12/32] specfs: Split spec_open switch into three sections.

The sections are now:

1. Acquire open reference.

1a (intermezzo). Set VV_ISTTY.

2. Drop the vnode lock to call .d_open and autoload modules if
   necessary.

3. Handle concurrent revoke if it happenend, or release open reference
   if .d_open failed.

No functional change.  Sprinkle comments about problems.
---
 sys/miscfs/specfs/spec_vnops.c | 132 +++++++++++++++++++++++++++------
 1 file changed, 110 insertions(+), 22 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index ff9564a5a2b4..4244859095a5 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -542,6 +542,20 @@ spec_open(void *v)
 	if (error != 0)
 		return (error);
 
+	/*
+	 * Acquire an open reference -- as long as we hold onto it, and
+	 * the vnode isn't revoked, it can't be closed.
+	 *
+	 * But first check whether it has been revoked -- if so, we
+	 * can't acquire more open references and we must fail
+	 * immediately with EBADF.
+	 *
+	 * XXX This races with revoke: once we release the vnode lock,
+	 * the vnode may be revoked, and the .d_close callback run, at
+	 * the same time as we're calling .d_open here.  Drivers
+	 * shouldn't have to contemplate this scenario; .d_open and
+	 * .d_close should be prevented from running concurrently.
+	 */
 	switch (vp->v_type) {
 	case VCHR:
 		/*
@@ -556,8 +570,73 @@ spec_open(void *v)
 		sd->sd_opencnt++;
 		sn->sn_opencnt++;
 		mutex_exit(&device_lock);
+		break;
+	case VBLK:
+		/*
+		 * For block devices, permit only one open.  The buffer
+		 * cache cannot remain self-consistent with multiple
+		 * vnodes holding a block device open.
+		 *
+		 * Treat zero opencnt with non-NULL mountpoint as open.
+		 * This may happen after forced detach of a mounted device.
+		 */
+		mutex_enter(&device_lock);
+		if (sn->sn_gone) {
+			mutex_exit(&device_lock);
+			return (EBADF);
+		}
+		if (sd->sd_opencnt != 0 || sd->sd_mountpoint != NULL) {
+			mutex_exit(&device_lock);
+			return EBUSY;
+		}
+		sn->sn_opencnt = 1;
+		sd->sd_opencnt = 1;
+		sd->sd_bdevvp = vp;
+		mutex_exit(&device_lock);
+		break;
+	default:
+		panic("invalid specfs vnode type: %d", vp->v_type);
+	}
+
+	/*
+	 * Set VV_ISTTY if this is a tty cdev.
+	 *
+	 * XXX This does the wrong thing if the module has to be
+	 * autoloaded.  We should maybe set this after autoloading
+	 * modules and calling .d_open successfully, except (a) we need
+	 * the vnode lock to touch it, and (b) once we acquire the
+	 * vnode lock again, the vnode may have been revoked, and
+	 * deadfs's dead_read needs VV_ISTTY to be already set in order
+	 * to return the right answer.  So this needs some additional
+	 * synchronization to be made to work correctly with tty driver
+	 * module autoload.  For now, let's just hope it doesn't cause
+	 * too much trouble for a tty from an autoloaded driver module
+	 * to fail with EIO instead of returning EOF.
+	 */
+	if (vp->v_type == VCHR) {
 		if (cdev_type(dev) == D_TTY)
 			vp->v_vflag |= VV_ISTTY;
+	}
+
+	/*
+	 * Open the device.  If .d_open returns ENXIO (device not
+	 * configured), the driver may not be loaded, so try
+	 * autoloading a module and then try .d_open again if anything
+	 * got loaded.
+	 *
+	 * Because the opening the device may block indefinitely,
+	 * e.g. when opening a tty, and loading a module may cross into
+	 * many other subsystems, we must not hold the vnode lock while
+	 * calling .d_open, so release it now and reacquire it when
+	 * done.
+	 *
+	 * XXX The vnode may be revoked concurrently while we have the
+	 * vnode lock released.  If this happens, the sn and sd
+	 * pointers may be invalidated, but we don't do anything to
+	 * avoid touching them after we're done here.
+	 */
+	switch (vp->v_type) {
+	case VCHR:
 		VOP_UNLOCK(vp);
 		do {
 			const struct cdevsw *cdev;
@@ -586,27 +665,6 @@ spec_open(void *v)
 		break;
 
 	case VBLK:
-		/*
-		 * For block devices, permit only one open.  The buffer
-		 * cache cannot remain self-consistent with multiple
-		 * vnodes holding a block device open.
-		 *
-		 * Treat zero opencnt with non-NULL mountpoint as open.
-		 * This may happen after forced detach of a mounted device.
-		 */
-		mutex_enter(&device_lock);
-		if (sn->sn_gone) {
-			mutex_exit(&device_lock);
-			return (EBADF);
-		}
-		if (sd->sd_opencnt != 0 || sd->sd_mountpoint != NULL) {
-			mutex_exit(&device_lock);
-			return EBUSY;
-		}
-		sn->sn_opencnt = 1;
-		sd->sd_opencnt = 1;
-		sd->sd_bdevvp = vp;
-		mutex_exit(&device_lock);
 		VOP_UNLOCK(vp);
 		do {
 			const struct bdevsw *bdev;
@@ -635,9 +693,39 @@ spec_open(void *v)
 		break;
 
 	default:
-		panic("invalid specfs vnode type: %d", vp->v_type);
+		__unreachable();
 	}
 
+	/*
+	 * If it has been revoked since we released the vnode lock and
+	 * reacquired it, then spec_node_revoke has closed it, and we
+	 * must fail with EBADF.
+	 *
+	 * Otherwise, if opening it failed, back out and release the
+	 * open reference.
+	 *
+	 * XXX This is wrong -- we might release the last open
+	 * reference here, but we don't close the device.  If only this
+	 * thread's call to open failed, that's fine, but we might
+	 * have:
+	 *
+	 *	Thread 1		Thread 2
+	 *	VOP_OPEN
+	 *	  ...
+	 *	  .d_open -> 0 (success)
+	 *	  acquire vnode lock
+	 *	  do stuff		VOP_OPEN
+	 *	  release vnode lock	...
+	 *				  .d_open -> EBUSY
+	 *	VOP_CLOSE
+	 *	  acquire vnode lock
+	 *	  --sd_opencnt != 0
+	 *	  => no .d_close
+	 *	  release vnode lock
+	 *				  acquire vnode lock
+	 *				  --sd_opencnt == 0
+	 *				  but no .d_close (***)
+	 */
 	mutex_enter(&device_lock);
 	if (sn->sn_gone) {
 		if (error == 0)

>From ab4c7bb6c36978d6942c16437c8d658a69a73f07 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Fri, 21 Jan 2022 14:44:00 +0000
Subject: [PATCH 13/32] specfs: Delete bogus comment about .d_open/.d_close at
 same time.

Annoying as it is that .d_open and .d_close can run at the same time,
it is also necessary for tty semantics, where open can block
indefinitely, and it is the responsibility of close (called via
revoke) necessary to interrupt it.
---
 sys/miscfs/specfs/spec_vnops.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 4244859095a5..d11c28587c2b 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -549,12 +549,6 @@ spec_open(void *v)
 	 * But first check whether it has been revoked -- if so, we
 	 * can't acquire more open references and we must fail
 	 * immediately with EBADF.
-	 *
-	 * XXX This races with revoke: once we release the vnode lock,
-	 * the vnode may be revoked, and the .d_close callback run, at
-	 * the same time as we're calling .d_open here.  Drivers
-	 * shouldn't have to contemplate this scenario; .d_open and
-	 * .d_close should be prevented from running concurrently.
 	 */
 	switch (vp->v_type) {
 	case VCHR:

>From 8702ec465852937b605e09b8edcd2154d9f2eaea Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 00:38:12 +0000
Subject: [PATCH 14/32] specfs: Factor common device_lock out of switch for
 clarity.

No functional change.
---
 sys/miscfs/specfs/spec_vnops.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index d11c28587c2b..dde032ec7fc7 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -550,20 +550,19 @@ spec_open(void *v)
 	 * can't acquire more open references and we must fail
 	 * immediately with EBADF.
 	 */
+	mutex_enter(&device_lock);
 	switch (vp->v_type) {
 	case VCHR:
 		/*
 		 * Character devices can accept opens from multiple
 		 * vnodes.
 		 */
-		mutex_enter(&device_lock);
 		if (sn->sn_gone) {
-			mutex_exit(&device_lock);
-			return (EBADF);
+			error = EBADF;
+			break;
 		}
 		sd->sd_opencnt++;
 		sn->sn_opencnt++;
-		mutex_exit(&device_lock);
 		break;
 	case VBLK:
 		/*
@@ -574,23 +573,24 @@ spec_open(void *v)
 		 * Treat zero opencnt with non-NULL mountpoint as open.
 		 * This may happen after forced detach of a mounted device.
 		 */
-		mutex_enter(&device_lock);
 		if (sn->sn_gone) {
-			mutex_exit(&device_lock);
-			return (EBADF);
+			error = EBADF;
+			break;
 		}
 		if (sd->sd_opencnt != 0 || sd->sd_mountpoint != NULL) {
-			mutex_exit(&device_lock);
-			return EBUSY;
+			error = EBUSY;
+			break;
 		}
 		sn->sn_opencnt = 1;
 		sd->sd_opencnt = 1;
 		sd->sd_bdevvp = vp;
-		mutex_exit(&device_lock);
 		break;
 	default:
 		panic("invalid specfs vnode type: %d", vp->v_type);
 	}
+	mutex_exit(&device_lock);
+	if (error)
+		return error;
 
 	/*
 	 * Set VV_ISTTY if this is a tty cdev.

>From 6b1facaaf56bfa661fd5e107054c6cc4dae5ffbe Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 00:42:24 +0000
Subject: [PATCH 15/32] specfs: Factor VOP_UNLOCK/vn_lock out of switch for
 clarity.

No functional change.
---
 sys/miscfs/specfs/spec_vnops.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index dde032ec7fc7..2981b988ac86 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -629,9 +629,9 @@ spec_open(void *v)
 	 * pointers may be invalidated, but we don't do anything to
 	 * avoid touching them after we're done here.
 	 */
+	VOP_UNLOCK(vp);
 	switch (vp->v_type) {
 	case VCHR:
-		VOP_UNLOCK(vp);
 		do {
 			const struct cdevsw *cdev;
 
@@ -654,12 +654,9 @@ spec_open(void *v)
 			/* Try to autoload device module */
 			(void) module_autoload(name, MODULE_CLASS_DRIVER);
 		} while (gen != module_gen);
-
-		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 		break;
 
 	case VBLK:
-		VOP_UNLOCK(vp);
 		do {
 			const struct bdevsw *bdev;
 
@@ -682,13 +679,12 @@ spec_open(void *v)
                         /* Try to autoload device module */
 			(void) module_autoload(name, MODULE_CLASS_DRIVER);
 		} while (gen != module_gen);
-		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
-
 		break;
 
 	default:
 		__unreachable();
 	}
+	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 
 	/*
 	 * If it has been revoked since we released the vnode lock and

>From 184748acf4ffe145cdb3f0fddbce8cc21ba06111 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 00:55:03 +0000
Subject: [PATCH 16/32] specfs: Reorganize D_DISK tail of spec_open and explain
 what's up.

---
 sys/miscfs/specfs/spec_vnops.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 2981b988ac86..a6da50fb400c 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -729,15 +729,27 @@ spec_open(void *v)
 	}
 	mutex_exit(&device_lock);
 
-	if (cdev_type(dev) != D_DISK || error != 0)
+	/* If anything went wrong, we're done.  */
+	if (error)
 		return error;
 
-	
-	ioctl = vp->v_type == VCHR ? cdev_ioctl : bdev_ioctl;
-	error = (*ioctl)(vp->v_rdev, DIOCGPARTINFO, &pi, FREAD, curlwp);
-	if (error == 0)
-		uvm_vnp_setsize(vp, (voff_t)pi.pi_secsize * pi.pi_size);
+	/*
+	 * For disk devices, automagically set the vnode size to the
+	 * partition size, if we can.  This applies to block devices
+	 * and character devices alike -- every block device must have
+	 * a corresponding character device.  And if the module is
+	 * loaded it will remain loaded until we're done here (it is
+	 * forbidden to devsw_detach until closed).  So it is safe to
+	 * query cdev_type unconditionally here.
+	 */
+	if (cdev_type(dev) == D_DISK) {
+		ioctl = vp->v_type == VCHR ? cdev_ioctl : bdev_ioctl;
+		if ((*ioctl)(dev, DIOCGPARTINFO, &pi, FREAD, curlwp) == 0)
+			uvm_vnp_setsize(vp,
+			    (voff_t)pi.pi_secsize * pi.pi_size);
+	}
 
+	/* Success!  */
 	return 0;
 }
 

>From 7e77013b22e0d36b7c9bf63888c3979a7caac5da Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 01:11:45 +0000
Subject: [PATCH 17/32] specfs: sn_gone cannot be set while we hold the vnode
 lock.

vrevoke suspends the file system, which waits for the vnode lock to
be released, before it sets sn_gone and changes v_op so nothing can
re-enter spec_open with this vnode.
---
 sys/miscfs/specfs/spec_vnops.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index a6da50fb400c..ae49dcd21a15 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -545,10 +545,6 @@ spec_open(void *v)
 	/*
 	 * Acquire an open reference -- as long as we hold onto it, and
 	 * the vnode isn't revoked, it can't be closed.
-	 *
-	 * But first check whether it has been revoked -- if so, we
-	 * can't acquire more open references and we must fail
-	 * immediately with EBADF.
 	 */
 	mutex_enter(&device_lock);
 	switch (vp->v_type) {
@@ -557,10 +553,7 @@ spec_open(void *v)
 		 * Character devices can accept opens from multiple
 		 * vnodes.
 		 */
-		if (sn->sn_gone) {
-			error = EBADF;
-			break;
-		}
+		KASSERT(!sn->sn_gone);
 		sd->sd_opencnt++;
 		sn->sn_opencnt++;
 		break;
@@ -573,10 +566,7 @@ spec_open(void *v)
 		 * Treat zero opencnt with non-NULL mountpoint as open.
 		 * This may happen after forced detach of a mounted device.
 		 */
-		if (sn->sn_gone) {
-			error = EBADF;
-			break;
-		}
+		KASSERT(!sn->sn_gone);
 		if (sd->sd_opencnt != 0 || sd->sd_mountpoint != NULL) {
 			error = EBUSY;
 			break;

>From a53eaacbce2033977609e770d7427f1c4ac2967d Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 01:12:46 +0000
Subject: [PATCH 18/32] specfs: Factor KASSERT out of switch in spec_open.

No functional change.
---
 sys/miscfs/specfs/spec_vnops.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index ae49dcd21a15..4ed77280f723 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -547,13 +547,13 @@ spec_open(void *v)
 	 * the vnode isn't revoked, it can't be closed.
 	 */
 	mutex_enter(&device_lock);
+	KASSERT(!sn->sn_gone);
 	switch (vp->v_type) {
 	case VCHR:
 		/*
 		 * Character devices can accept opens from multiple
 		 * vnodes.
 		 */
-		KASSERT(!sn->sn_gone);
 		sd->sd_opencnt++;
 		sn->sn_opencnt++;
 		break;
@@ -566,7 +566,6 @@ spec_open(void *v)
 		 * Treat zero opencnt with non-NULL mountpoint as open.
 		 * This may happen after forced detach of a mounted device.
 		 */
-		KASSERT(!sn->sn_gone);
 		if (sd->sd_opencnt != 0 || sd->sd_mountpoint != NULL) {
 			error = EBUSY;
 			break;

>From 82020dadd82abe5a9493bbc32052a921e1ed87e8 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 01:16:23 +0000
Subject: [PATCH 19/32] specfs: If sd_opencnt is zero, sn_opencnt had better be
 zero.

---
 sys/miscfs/specfs/spec_vnops.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 4ed77280f723..bc6c414a3b28 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -570,6 +570,7 @@ spec_open(void *v)
 			error = EBUSY;
 			break;
 		}
+		KASSERTMSG(sn->sn_opencnt == 0, "%u", sn->sn_opencnt);
 		sn->sn_opencnt = 1;
 		sd->sd_opencnt = 1;
 		sd->sd_bdevvp = vp;

>From 85945ca9f009f902af9902aa206e99b60afc50f0 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 02:03:31 +0000
Subject: [PATCH 20/32] specfs: Add a comment and assertion to spec_close about
 refcnts.

---
 sys/miscfs/specfs/spec_vnops.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index bc6c414a3b28..5a0f0ccdd60b 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -1338,11 +1338,20 @@ spec_close(void *v)
 		panic("spec_close: not special");
 	}
 
+	/*
+	 * Decrement the open reference count of this node and the
+	 * device.  For block devices, the open reference count must be
+	 * 1 at this point.  If the device's open reference count goes
+	 * to zero, we're the last one out so get the lights.
+	 */
 	mutex_enter(&device_lock);
 	sn->sn_opencnt--;
 	count = --sd->sd_opencnt;
-	if (vp->v_type == VBLK)
+	if (vp->v_type == VBLK) {
+		KASSERTMSG(count == 0, "block device with %u opens",
+		    count + 1);
 		sd->sd_bdevvp = NULL;
+	}
 	mutex_exit(&device_lock);
 
 	if (count != 0)

>From 4889c8b3b71c1626ba11feef57689516e5249a07 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 01:56:05 +0000
Subject: [PATCH 21/32] specfs: Omit needless vdead_check in spec_fdiscard.

The vnode lock is held, so the vnode cannot be revoked without also
changing v_op so subsequent uses under the vnode lock will go to
deadfs's VOP_FDISCARD instead (which is genfs_eopnotsupp).
---
 sys/miscfs/specfs/spec_vnops.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 5a0f0ccdd60b..42c79d10c9ff 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -939,17 +939,7 @@ spec_fdiscard(void *v)
 	dev_t dev;
 
 	vp = ap->a_vp;
-	dev = NODEV;
-
-	mutex_enter(vp->v_interlock);
-	if (vdead_check(vp, VDEAD_NOWAIT) == 0 && vp->v_specnode != NULL) {
-		dev = vp->v_rdev;
-	}
-	mutex_exit(vp->v_interlock);
-
-	if (dev == NODEV) {
-		return ENXIO;
-	}
+	dev = vp->v_rdev;
 
 	switch (vp->v_type) {
 	    case VCHR:

>From 5d92c02e9005ea1016e383a6c5f94d199c5a0108 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 12:10:55 +0000
Subject: [PATCH 22/32] specfs: Assert specnode is open before we bdev_ioctl.

spec_node_setmountedfs should only be called while the vnode is open;
otherwise nothing can be mounted from it.
---
 sys/miscfs/specfs/spec_vnops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 42c79d10c9ff..2f9ff6d9d9f5 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -373,6 +373,8 @@ spec_node_setmountedfs(vnode_t *devvp, struct mount *mp)
 
 	KASSERT(devvp->v_type == VBLK);
 	KASSERT(devvp->v_specnode->sn_dev->sd_mountpoint == NULL || mp == NULL);
+	KASSERT(devvp->v_specnode->sn_opencnt);
+
 	devvp->v_specnode->sn_dev->sd_mountpoint = mp;
 	if (mp == NULL)
 		return;

>From f424ab2ceb16bbb57f830044f4b0268d89e36fb6 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Wed, 19 Jan 2022 12:52:23 +0000
Subject: [PATCH 23/32] specfs: Drain all I/O operations after last .d_close
 call.

New kind of I/O reference on specnodes, sn_iocnt.  This could be done
with psref instead; I chose a reference count instead for now because
we already have to take a per-object lock anyway, v_interlock, for
vdead_check, so another atomic is not likely to hurt much more.  We
can always change the mechanism inside spec_io_enter/exit/drain later
on.

Make sure every access to vp->v_rdev or vp->v_specnode and every call
to a devsw operation is protected either:

- by the vnode lock (with vdead_check if we unlocked/relocked),
- by positive sn_opencnt,
- by spec_io_enter/exit, or
- by sd_opencloselock and sn_opencnt management in open/close.
---
 sys/miscfs/specfs/spec_vnops.c | 338 ++++++++++++++++++++++++---------
 sys/miscfs/specfs/specdev.h    |   1 +
 2 files changed, 251 insertions(+), 88 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 2f9ff6d9d9f5..c256183a9925 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -81,6 +81,7 @@ __KERNEL_RCSID(0, "$NetBSD: spec_vnops.c,v 1.183 2021/07/18 23:57:14 dholland Ex
 #include <sys/kauth.h>
 #include <sys/fstrans.h>
 #include <sys/module.h>
+#include <sys/atomic.h>
 
 #include <miscfs/genfs/genfs.h>
 #include <miscfs/specfs/specdev.h>
@@ -165,6 +166,7 @@ const struct vnodeopv_desc spec_vnodeop_opv_desc =
 	{ &spec_vnodeop_p, spec_vnodeop_entries };
 
 static kauth_listener_t rawio_listener;
+static struct kcondvar specfs_iocv;
 
 /* Returns true if vnode is /dev/mem or /dev/kmem. */
 bool
@@ -210,6 +212,123 @@ spec_init(void)
 
 	rawio_listener = kauth_listen_scope(KAUTH_SCOPE_DEVICE,
 	    rawio_listener_cb, NULL);
+	cv_init(&specfs_iocv, "specio");
+}
+
+/*
+ * spec_io_enter(vp, &sn, &dev)
+ *
+ *	Enter an operation that may not hold vp's vnode lock or an
+ *	fstrans on vp's mount.  Until spec_io_exit, the vnode will not
+ *	be revoked.
+ *
+ *	On success, set sn to the specnode pointer and dev to the dev_t
+ *	number and return zero.  Caller must later call spec_io_exit
+ *	when done.
+ *
+ *	On failure, return ENXIO -- the device has been revoked and no
+ *	longer exists.
+ */
+static int
+spec_io_enter(struct vnode *vp, struct specnode **snp, dev_t *devp)
+{
+	dev_t dev;
+	struct specnode *sn;
+	unsigned iocnt;
+	int error = 0;
+
+	mutex_enter(vp->v_interlock);
+
+	/*
+	 * Extract all the info we need from the vnode, unless the
+	 * vnode has already been reclaimed.  This can happen if the
+	 * underlying device has been removed and all the device nodes
+	 * for it have been revoked.  The caller may not hold a vnode
+	 * lock or fstrans to prevent this from happening before it has
+	 * had an opportunity to notice the vnode is dead.
+	 */
+	if (vdead_check(vp, VDEAD_NOWAIT) != 0 ||
+	    (sn = vp->v_specnode) == NULL ||
+	    (dev = vp->v_rdev) == NODEV) {
+		error = ENXIO;
+		goto out;
+	}
+
+	/*
+	 * Notify spec_node_revoke that we are doing an I/O operation
+	 * which may not be not bracketed by fstrans(9) and thus is not
+	 * blocked by vfs suspension.
+	 *
+	 * We could hold this reference with psref(9) instead, but we
+	 * already have to take the interlock for vdead_check, so
+	 * there's not much more cost here to another atomic operation.
+	 */
+	iocnt = atomic_inc_uint_nv(&sn->sn_iocnt);
+	CTASSERT(MAXLWP < UINT_MAX);
+	KASSERT(iocnt < UINT_MAX);
+
+	/* Success!  */
+	*snp = sn;
+	*devp = dev;
+	error = 0;
+
+out:	mutex_exit(vp->v_interlock);
+	return error;
+}
+
+/*
+ * spec_io_exit(vp, sn)
+ *
+ *	Exit an operation entered with a successful spec_io_enter --
+ *	allow concurrent spec_node_revoke to proceed.  The argument sn
+ *	must match the struct specnode pointer returned by spec_io_exit
+ *	for vp.
+ */
+static void
+spec_io_exit(struct vnode *vp, struct specnode *sn)
+{
+	unsigned iocnt;
+
+	KASSERT(vp->v_specnode == sn);
+
+	/*
+	 * We are done.  Notify spec_node_revoke if appropriate.  The
+	 * transition of 1 -> 0 must happen under device_lock so
+	 * spec_node_revoke doesn't miss a wakeup.
+	 */
+	do {
+		iocnt = atomic_load_relaxed(&sn->sn_iocnt);
+		if (iocnt == 1) {
+			mutex_enter(&device_lock);
+			if (atomic_dec_uint_nv(&sn->sn_iocnt) == 0)
+				cv_broadcast(&specfs_iocv);
+			mutex_exit(&device_lock);
+			break;
+		}
+	} while (atomic_cas_uint(&sn->sn_iocnt, iocnt, iocnt - 1) != iocnt);
+}
+
+/*
+ * spec_io_drain(vp, sn)
+ *
+ *	Wait for all existing spec_io_enter/exit sections to complete.
+ *	Caller must ensure spec_io_enter will fail at this point.
+ */
+static void
+spec_io_drain(struct vnode *vp, struct specnode *sn)
+{
+
+	/*
+	 * I/O at the same time as closing is unlikely -- it often
+	 * indicates an application bug.
+	 */
+	if (__predict_true(atomic_load_relaxed(&sn->sn_iocnt) == 0))
+		return;
+
+	mutex_enter(&device_lock);
+	while (atomic_load_relaxed(&sn->sn_iocnt) != 0)
+		cv_wait(&specfs_iocv, &device_lock);
+	mutex_exit(&device_lock);
 }
 
 /*
@@ -261,6 +380,7 @@ spec_node_init(vnode_t *vp, dev_t rdev)
 	sn->sn_opencnt = 0;
 	sn->sn_rdev = rdev;
 	sn->sn_gone = false;
+	sn->sn_iocnt = 0;
 	vp->v_specnode = sn;
 	vp->v_specnext = *vpp;
 	*vpp = vp;
@@ -437,6 +557,7 @@ spec_node_destroy(vnode_t *vp)
 	KASSERT(vp->v_type == VBLK || vp->v_type == VCHR);
 	KASSERT(vp->v_specnode != NULL);
 	KASSERT(sn->sn_opencnt == 0);
+	KASSERT(sn->sn_iocnt == 0);
 
 	mutex_enter(&device_lock);
 	/* Remove from the hash and destroy the node. */
@@ -500,29 +621,27 @@ spec_open(void *v)
 		int  a_mode;
 		kauth_cred_t a_cred;
 	} */ *ap = v;
-	struct lwp *l;
-	struct vnode *vp;
+	struct lwp *l = curlwp;
+	struct vnode *vp = ap->a_vp;
 	dev_t dev;
 	int error;
 	enum kauth_device_req req;
 	specnode_t *sn;
 	specdev_t *sd;
 	spec_ioctl_t ioctl;
-	u_int gen;
-	const char *name;
+	u_int gen = 0;
+	const char *name = NULL;
 	struct partinfo pi;
-	
-	l = curlwp;
-	vp = ap->a_vp;
-	dev = vp->v_rdev;
-	sn = vp->v_specnode;
-	sd = sn->sn_dev;
-	name = NULL;
-	gen = 0;
 
+	KASSERT(VOP_ISLOCKED(vp) == LK_EXCLUSIVE ||
+	    (vp->v_vflag & VV_LOCKSWORK) == 0);
 	KASSERTMSG(vp->v_type == VBLK || vp->v_type == VCHR, "type=%d",
 	    vp->v_type);
 
+	dev = vp->v_rdev;
+	sn = vp->v_specnode;
+	sd = sn->sn_dev;
+
 	/*
 	 * Don't allow open if fs is mounted -nodev.
 	 */
@@ -761,6 +880,8 @@ spec_read(void *v)
 	struct vnode *vp = ap->a_vp;
 	struct uio *uio = ap->a_uio;
  	struct lwp *l = curlwp;
+	struct specnode *sn;
+	dev_t dev;
 	struct buf *bp;
 	daddr_t bn;
 	int bsize, bscale;
@@ -783,9 +904,27 @@ spec_read(void *v)
 	switch (vp->v_type) {
 
 	case VCHR:
+		/*
+		 * Release the lock while we sleep -- possibly
+		 * indefinitely, if this is, e.g., a tty -- in
+		 * cdev_read, so we don't hold up everything else that
+		 * might want access to the vnode.
+		 *
+		 * But before we issue the read, take an I/O reference
+		 * to the specnode so close will know when we're done
+		 * writing.  Note that the moment we release the lock,
+		 * the vnode's identity may change; hence spec_io_enter
+		 * may fail, and the caller may have a dead vnode on
+		 * their hands, if the file system on which vp lived
+		 * has been unmounted.
+		 */
 		VOP_UNLOCK(vp);
-		error = cdev_read(vp->v_rdev, uio, ap->a_ioflag);
-		vn_lock(vp, LK_SHARED | LK_RETRY);
+		error = spec_io_enter(vp, &sn, &dev);
+		if (error)
+			goto out;
+		error = cdev_read(dev, uio, ap->a_ioflag);
+		spec_io_exit(vp, sn);
+out:		vn_lock(vp, LK_SHARED | LK_RETRY);
 		return (error);
 
 	case VBLK:
@@ -862,6 +1001,8 @@ spec_write(void *v)
 	struct vnode *vp = ap->a_vp;
 	struct uio *uio = ap->a_uio;
 	struct lwp *l = curlwp;
+	struct specnode *sn;
+	dev_t dev;
 	struct buf *bp;
 	daddr_t bn;
 	int bsize, bscale;
@@ -877,9 +1018,27 @@ spec_write(void *v)
 	switch (vp->v_type) {
 
 	case VCHR:
+		/*
+		 * Release the lock while we sleep -- possibly
+		 * indefinitely, if this is, e.g., a tty -- in
+		 * cdev_write, so we don't hold up everything else that
+		 * might want access to the vnode.
+		 *
+		 * But before we issue the write, take an I/O reference
+		 * to the specnode so close will know when we're done
+		 * writing.  Note that the moment we release the lock,
+		 * the vnode's identity may change; hence spec_io_enter
+		 * may fail, and the caller may have a dead vnode on
+		 * their hands, if the file system on which vp lived
+		 * has been unmounted.
+		 */
 		VOP_UNLOCK(vp);
-		error = cdev_write(vp->v_rdev, uio, ap->a_ioflag);
-		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
+		error = spec_io_enter(vp, &sn, &dev);
+		if (error)
+			goto out;
+		error = cdev_write(dev, uio, ap->a_ioflag);
+		spec_io_exit(vp, sn);
+out:		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 		return (error);
 
 	case VBLK:
@@ -937,10 +1096,12 @@ spec_fdiscard(void *v)
 		off_t a_pos;
 		off_t a_len;
 	} */ *ap = v;
-	struct vnode *vp;
+	struct vnode *vp = ap->a_vp;
 	dev_t dev;
 
-	vp = ap->a_vp;
+	KASSERT(VOP_ISLOCKED(vp) == LK_EXCLUSIVE ||
+	    (vp->v_vflag & VV_LOCKSWORK) == 0);
+
 	dev = vp->v_rdev;
 
 	switch (vp->v_type) {
@@ -970,40 +1131,32 @@ spec_ioctl(void *v)
 		int  a_fflag;
 		kauth_cred_t a_cred;
 	} */ *ap = v;
-	struct vnode *vp;
+	struct vnode *vp = ap->a_vp;
+	struct specnode *sn;
 	dev_t dev;
+	int error;
 
-	/*
-	 * Extract all the info we need from the vnode, taking care to
-	 * avoid a race with VOP_REVOKE().
-	 */
-
-	vp = ap->a_vp;
-	dev = NODEV;
-	mutex_enter(vp->v_interlock);
-	if (vdead_check(vp, VDEAD_NOWAIT) == 0 && vp->v_specnode) {
-		dev = vp->v_rdev;
-	}
-	mutex_exit(vp->v_interlock);
-	if (dev == NODEV) {
-		return ENXIO;
-	}
+	error = spec_io_enter(vp, &sn, &dev);
+	if (error)
+		return error;
 
 	switch (vp->v_type) {
-
 	case VCHR:
-		return cdev_ioctl(dev, ap->a_command, ap->a_data,
+		error = cdev_ioctl(dev, ap->a_command, ap->a_data,
 		    ap->a_fflag, curlwp);
-
+		break;
 	case VBLK:
 		KASSERT(vp == vp->v_specnode->sn_dev->sd_bdevvp);
-		return bdev_ioctl(dev, ap->a_command, ap->a_data,
+		error = bdev_ioctl(dev, ap->a_command, ap->a_data,
 		   ap->a_fflag, curlwp);
-
+		break;
 	default:
 		panic("spec_ioctl");
 		/* NOTREACHED */
 	}
+
+	spec_io_exit(vp, sn);
+	return error;
 }
 
 /* ARGSUSED */
@@ -1014,33 +1167,25 @@ spec_poll(void *v)
 		struct vnode *a_vp;
 		int a_events;
 	} */ *ap = v;
-	struct vnode *vp;
+	struct vnode *vp = ap->a_vp;
+	struct specnode *sn;
 	dev_t dev;
+	int revents;
 
-	/*
-	 * Extract all the info we need from the vnode, taking care to
-	 * avoid a race with VOP_REVOKE().
-	 */
-
-	vp = ap->a_vp;
-	dev = NODEV;
-	mutex_enter(vp->v_interlock);
-	if (vdead_check(vp, VDEAD_NOWAIT) == 0 && vp->v_specnode) {
-		dev = vp->v_rdev;
-	}
-	mutex_exit(vp->v_interlock);
-	if (dev == NODEV) {
+	if (spec_io_enter(vp, &sn, &dev) != 0)
 		return POLLERR;
-	}
 
 	switch (vp->v_type) {
-
 	case VCHR:
-		return cdev_poll(dev, ap->a_events, curlwp);
-
+		revents = cdev_poll(dev, ap->a_events, curlwp);
+		break;
 	default:
-		return (genfs_poll(v));
+		revents = genfs_poll(v);
+		break;
 	}
+
+	spec_io_exit(vp, sn);
+	return revents;
 }
 
 /* ARGSUSED */
@@ -1051,20 +1196,30 @@ spec_kqfilter(void *v)
 		struct vnode	*a_vp;
 		struct proc	*a_kn;
 	} */ *ap = v;
+	struct vnode *vp = ap->a_vp;
+	struct specnode *sn;
 	dev_t dev;
+	int error;
 
-	switch (ap->a_vp->v_type) {
+	error = spec_io_enter(vp, &sn, &dev);
+	if (error)
+		return error;
 
+	switch (vp->v_type) {
 	case VCHR:
-		dev = ap->a_vp->v_rdev;
-		return cdev_kqfilter(dev, ap->a_kn);
+		error = cdev_kqfilter(dev, ap->a_kn);
+		break;
 	default:
 		/*
 		 * Block devices don't support kqfilter, and refuse it
 		 * for any other files (like those vflush()ed) too.
 		 */
-		return (EOPNOTSUPP);
+		error = EOPNOTSUPP;
+		break;
 	}
+
+	spec_io_exit(vp, sn);
+	return error;
 }
 
 /*
@@ -1079,11 +1234,19 @@ spec_mmap(void *v)
 		kauth_cred_t a_cred;
 	} */ *ap = v;
 	struct vnode *vp = ap->a_vp;
+	struct specnode *sn;
+	dev_t dev;
+	int error;
 
 	KASSERT(vp->v_type == VBLK);
-	if (bdev_type(vp->v_rdev) != D_DISK)
-		return EINVAL;
 
+	error = spec_io_enter(vp, &sn, &dev);
+	if (error)
+		return error;
+
+	error = bdev_type(dev) == D_DISK ? 0 : EINVAL;
+
+	spec_io_exit(vp, sn);
 	return 0;
 }
 
@@ -1128,27 +1291,14 @@ spec_strategy(void *v)
 	} */ *ap = v;
 	struct vnode *vp = ap->a_vp;
 	struct buf *bp = ap->a_bp;
+	struct specnode *sn = NULL;
 	dev_t dev;
 	int error;
 
-	dev = NODEV;
-
-	/*
-	 * Extract all the info we need from the vnode, taking care to
-	 * avoid a race with VOP_REVOKE().
-	 */
-
-	mutex_enter(vp->v_interlock);
-	if (vdead_check(vp, VDEAD_NOWAIT) == 0 && vp->v_specnode != NULL) {
-		KASSERT(vp == vp->v_specnode->sn_dev->sd_bdevvp);
-		dev = vp->v_rdev;
-	}
-	mutex_exit(vp->v_interlock);
-
-	if (dev == NODEV) {
-		error = ENXIO;
+	error = spec_io_enter(vp, &sn, &dev);
+	if (error)
 		goto out;
-	}
+
 	bp->b_dev = dev;
 
 	if (!(bp->b_flags & B_READ)) {
@@ -1168,13 +1318,15 @@ spec_strategy(void *v)
 	}
 	bdev_strategy(bp);
 
-	return 0;
-
-out:
-	bp->b_error = error;
-	bp->b_resid = bp->b_bcount;
-	biodone(bp);
+	error = 0;
 
+out:	if (sn)
+		spec_io_exit(vp, sn);
+	if (error) {
+		bp->b_error = error;
+		bp->b_resid = bp->b_bcount;
+		biodone(bp);
+	}
 	return error;
 }
 
@@ -1243,14 +1395,18 @@ spec_close(void *v)
 	} */ *ap = v;
 	struct vnode *vp = ap->a_vp;
 	struct session *sess;
-	dev_t dev = vp->v_rdev;
+	dev_t dev;
 	int flags = ap->a_fflag;
 	int mode, error, count;
 	specnode_t *sn;
 	specdev_t *sd;
 
+	KASSERT(VOP_ISLOCKED(vp) == LK_EXCLUSIVE ||
+	    (vp->v_vflag & VV_LOCKSWORK) == 0);
+
 	mutex_enter(vp->v_interlock);
 	sn = vp->v_specnode;
+	dev = vp->v_rdev;
 	sd = sn->sn_dev;
 	/*
 	 * If we're going away soon, make this non-blocking.
@@ -1362,6 +1518,12 @@ spec_close(void *v)
 	else
 		error = cdev_close(dev, flags, mode, curlwp);
 
+	/*
+	 * Wait for all other devsw operations to drain.  After this
+	 * point, no bdev/cdev_* can be active for this specnode.
+	 */
+	spec_io_drain(vp, sn);
+
 	if (!(flags & FNONBLOCK))
 		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 
diff --git a/sys/miscfs/specfs/specdev.h b/sys/miscfs/specfs/specdev.h
index 621c103e80b1..f8af6579f893 100644
--- a/sys/miscfs/specfs/specdev.h
+++ b/sys/miscfs/specfs/specdev.h
@@ -69,6 +69,7 @@ typedef struct specnode {
 	u_int		sn_opencnt;
 	dev_t		sn_rdev;
 	bool		sn_gone;
+	volatile u_int	sn_iocnt;
 } specnode_t;
 
 typedef struct specdev {

>From bfbc3fc77539c967aaecdbeab896c654dd45fdb6 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Mon, 24 Jan 2022 10:06:17 +0000
Subject: [PATCH 24/32] specfs: Paranoia: Assert opencnt and iocnt are zero on
 reclaim.

---
 sys/miscfs/specfs/spec_vnops.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index c256183a9925..009a502e3392 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -1352,6 +1352,9 @@ spec_reclaim(void *v)
 	} */ *ap = v;
 	struct vnode *vp = ap->a_vp;
 
+	KASSERT(vp->v_specnode->sn_opencnt == 0);
+	KASSERT(vp->v_specnode->sn_iocnt == 0);
+
 	VOP_UNLOCK(vp);
 
 	KASSERT(vp->v_mount == dead_rootmount);

>From 78bbff17031265cab18a2a816ff23f237b22543e Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Fri, 21 Jan 2022 12:09:14 +0000
Subject: [PATCH 25/32] specfs: Avoid race with revoke freeing specnode during
 spec_open.

---
 sys/miscfs/specfs/spec_vnops.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 009a502e3392..2705fd81f8ef 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -734,11 +734,6 @@ spec_open(void *v)
 	 * many other subsystems, we must not hold the vnode lock while
 	 * calling .d_open, so release it now and reacquire it when
 	 * done.
-	 *
-	 * XXX The vnode may be revoked concurrently while we have the
-	 * vnode lock released.  If this happens, the sn and sd
-	 * pointers may be invalidated, but we don't do anything to
-	 * avoid touching them after we're done here.
 	 */
 	VOP_UNLOCK(vp);
 	switch (vp->v_type) {
@@ -800,7 +795,8 @@ spec_open(void *v)
 	/*
 	 * If it has been revoked since we released the vnode lock and
 	 * reacquired it, then spec_node_revoke has closed it, and we
-	 * must fail with EBADF.
+	 * must fail with EBADF without touching sn or sd which may be
+	 * freed at this point.
 	 *
 	 * Otherwise, if opening it failed, back out and release the
 	 * open reference.
@@ -828,16 +824,20 @@ spec_open(void *v)
 	 *				  but no .d_close (***)
 	 */
 	mutex_enter(&device_lock);
-	if (sn->sn_gone) {
+	mutex_enter(vp->v_interlock);
+	if (vdead_check(vp, VDEAD_NOWAIT) != 0) {
 		if (error == 0)
 			error = EBADF;
-	} else if (error != 0) {
-		sd->sd_opencnt--;
-		sn->sn_opencnt--;
-		if (vp->v_type == VBLK)
-			sd->sd_bdevvp = NULL;
-
+	} else {
+		KASSERT(!sn->sn_gone);
+		if (error != 0) {
+			sd->sd_opencnt--;
+			sn->sn_opencnt--;
+			if (vp->v_type == VBLK)
+				sd->sd_bdevvp = NULL;
+		}
 	}
+	mutex_exit(vp->v_interlock);
 	mutex_exit(&device_lock);
 
 	/* If anything went wrong, we're done.  */

>From 2b47e9711312384370d7f95144b4bf3eb206970b Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Fri, 21 Jan 2022 12:13:49 +0000
Subject: [PATCH 26/32] specfs: Note lock order for vnode lock, device_lock,
 v_interlock.

---
 sys/miscfs/specfs/spec_vnops.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 2705fd81f8ef..f5f8d5c88f1d 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -86,6 +86,14 @@ __KERNEL_RCSID(0, "$NetBSD: spec_vnops.c,v 1.183 2021/07/18 23:57:14 dholland Ex
 #include <miscfs/genfs/genfs.h>
 #include <miscfs/specfs/specdev.h>
 
+/*
+ * Lock order:
+ *
+ *	vnode lock
+ *	-> device_lock
+ *	-> struct vnode::v_interlock
+ */
+
 /* symbolic sleep message strings for devices */
 const char	devopn[] = "devopn";
 const char	devio[] = "devio";

>From 427b976a22031462b4f6f0cb7fe58285c2d71fdb Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Fri, 21 Jan 2022 14:52:05 +0000
Subject: [PATCH 27/32] specfs: Resolve a race between close and a failing
 reopen.

---
 sys/miscfs/specfs/spec_vnops.c | 64 +++++++++++++++++++++++++++++-----
 sys/miscfs/specfs/specdev.h    |  1 +
 2 files changed, 57 insertions(+), 8 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index f5f8d5c88f1d..d6e93b9a42b7 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -377,6 +377,7 @@ spec_node_init(vnode_t *vp, dev_t rdev)
 		sd->sd_refcnt = 1;
 		sd->sd_opencnt = 0;
 		sd->sd_bdevvp = NULL;
+		sd->sd_opened = false;
 		sn->sn_dev = sd;
 		sd = NULL;
 	} else {
@@ -639,6 +640,7 @@ spec_open(void *v)
 	spec_ioctl_t ioctl;
 	u_int gen = 0;
 	const char *name = NULL;
+	bool needclose = false;
 	struct partinfo pi;
 
 	KASSERT(VOP_ISLOCKED(vp) == LK_EXCLUSIVE ||
@@ -807,12 +809,9 @@ spec_open(void *v)
 	 * freed at this point.
 	 *
 	 * Otherwise, if opening it failed, back out and release the
-	 * open reference.
-	 *
-	 * XXX This is wrong -- we might release the last open
-	 * reference here, but we don't close the device.  If only this
-	 * thread's call to open failed, that's fine, but we might
-	 * have:
+	 * open reference.  If it was ever successfully opened and we
+	 * got the last reference this way, it's now our job to close
+	 * it.  This might happen in the following scenario:
 	 *
 	 *	Thread 1		Thread 2
 	 *	VOP_OPEN
@@ -829,7 +828,10 @@ spec_open(void *v)
 	 *	  release vnode lock
 	 *				  acquire vnode lock
 	 *				  --sd_opencnt == 0
-	 *				  but no .d_close (***)
+	 *
+	 * We can't resolve this by making spec_close wait for .d_open
+	 * to complete before examining sd_opencnt, because .d_open can
+	 * hang indefinitely, e.g. for a tty.
 	 */
 	mutex_enter(&device_lock);
 	mutex_enter(vp->v_interlock);
@@ -839,15 +841,40 @@ spec_open(void *v)
 	} else {
 		KASSERT(!sn->sn_gone);
 		if (error != 0) {
-			sd->sd_opencnt--;
+			if (--sd->sd_opencnt == 0 && sd->sd_opened) {
+				needclose = true;
+				sd->sd_opened = false;
+			}
 			sn->sn_opencnt--;
 			if (vp->v_type == VBLK)
 				sd->sd_bdevvp = NULL;
+		} else {
+			sd->sd_opened = true;
 		}
 	}
 	mutex_exit(vp->v_interlock);
 	mutex_exit(&device_lock);
 
+	/*
+	 * If this open failed, but the device was previously opened,
+	 * and another thread concurrently closed the vnode while we
+	 * were in the middle of reopening it, the other thread will
+	 * see sd_opencnt > 0 and thus decide not to call .d_close --
+	 * it is now our responsibility to do so.
+	 *
+	 * XXX The flags passed to VOP_CLOSE here are wrong, but
+	 * drivers can't rely on FREAD|FWRITE anyway -- e.g., consider
+	 * a device opened by thread 0 with O_READ, then opened by
+	 * thread 1 with O_WRITE, then closed by thread 0, and finally
+	 * closed by thread 1; the last .d_close call will have FWRITE
+	 * but not FREAD.  We should just eliminate the FREAD/FWRITE
+	 * parameter to .d_close altogether.
+	 */
+	if (needclose) {
+		KASSERT(error);
+		VOP_CLOSE(vp, FNONBLOCK, NOCRED);
+	}
+
 	/* If anything went wrong, we're done.  */
 	if (error)
 		return error;
@@ -1502,6 +1529,25 @@ spec_close(void *v)
 	 * device.  For block devices, the open reference count must be
 	 * 1 at this point.  If the device's open reference count goes
 	 * to zero, we're the last one out so get the lights.
+	 *
+	 * We may find --sd->sd_opencnt gives zero, and yet
+	 * sd->sd_opened is false.  This happens if the vnode is
+	 * revoked at the same time as it is being opened, which can
+	 * happen when opening a tty blocks indefinitely.  In that
+	 * case, we still must call close -- it is the job of close to
+	 * interrupt the open.  Either way, the device will be no
+	 * longer opened, so we have to clear sd->sd_opened; subsequent
+	 * opens will have responsibility for issuing close.
+	 *
+	 * This has the side effect that the sequence of opens might
+	 * happen out of order -- we might end up doing open, open,
+	 * close, close, instead of open, close, open, close.  This is
+	 * unavoidable with the current devsw API, where open is
+	 * allowed to block and close must be able to run concurrently
+	 * to interrupt it.  It is the driver's responsibility to
+	 * ensure that close is idempotent so that this works.  Drivers
+	 * requiring per-open state and exact 1:1 correspondence
+	 * between open and close can use fd_clone.
 	 */
 	mutex_enter(&device_lock);
 	sn->sn_opencnt--;
@@ -1511,6 +1557,8 @@ spec_close(void *v)
 		    count + 1);
 		sd->sd_bdevvp = NULL;
 	}
+	if (count == 0)
+		sd->sd_opened = false;
 	mutex_exit(&device_lock);
 
 	if (count != 0)
diff --git a/sys/miscfs/specfs/specdev.h b/sys/miscfs/specfs/specdev.h
index f8af6579f893..1abb52964b50 100644
--- a/sys/miscfs/specfs/specdev.h
+++ b/sys/miscfs/specfs/specdev.h
@@ -79,6 +79,7 @@ typedef struct specdev {
 	u_int		sd_opencnt;
 	u_int		sd_refcnt;
 	dev_t		sd_rdev;
+	bool		sd_opened;
 } specdev_t;
 
 /*

>From 38b1f469efd07fbbb86c2593258b629f2511d7e0 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sun, 23 Jan 2022 19:25:05 +0000
Subject: [PATCH 28/32] specfs: Wait for devsw operations to drain in revoke.

The VOP_CLOSE in spec_node_revoke may run concurrently with the last
spec_close so it doesn't drain anything itself.
---
 sys/miscfs/specfs/spec_vnops.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index d6e93b9a42b7..395de37156f9 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -542,6 +542,13 @@ spec_node_revoke(vnode_t *vp)
 
 		VOP_CLOSE(vp, FNONBLOCK, NOCRED);
 
+		/*
+		 * Wait for all other devsw operations to drain.  After
+		 * this point, no bdev/cdev_* can be active for this
+		 * specnode.
+		 */
+		spec_io_drain(vp, sn);
+
 		mutex_enter(&device_lock);
 		KASSERT(sn->sn_opencnt == 0);
 	}
@@ -1577,12 +1584,6 @@ spec_close(void *v)
 	else
 		error = cdev_close(dev, flags, mode, curlwp);
 
-	/*
-	 * Wait for all other devsw operations to drain.  After this
-	 * point, no bdev/cdev_* can be active for this specnode.
-	 */
-	spec_io_drain(vp, sn);
-
 	if (!(flags & FNONBLOCK))
 		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 

>From d25bc733fe1849d763a2df7c333dabccfedb1e16 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sun, 23 Jan 2022 19:29:42 +0000
Subject: [PATCH 29/32] specfs: Scratch that -- drain unconditionally in
 spec_node_revoke.

There might be a concurrent spec_close happening that has already
dropped sn_opencnt to 0, but we need to _in revoke_ for the devsw
operations to complete.
---
 sys/miscfs/specfs/spec_vnops.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 395de37156f9..0e07424426e5 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -542,17 +542,18 @@ spec_node_revoke(vnode_t *vp)
 
 		VOP_CLOSE(vp, FNONBLOCK, NOCRED);
 
-		/*
-		 * Wait for all other devsw operations to drain.  After
-		 * this point, no bdev/cdev_* can be active for this
-		 * specnode.
-		 */
-		spec_io_drain(vp, sn);
-
 		mutex_enter(&device_lock);
 		KASSERT(sn->sn_opencnt == 0);
 	}
 	mutex_exit(&device_lock);
+
+	/*
+	 * Wait for all other devsw operations to drain.  After this
+	 * point, no bdev/cdev_* can be active for this specnode.
+	 * Note: We drain even if we witnessed sn->sn_opencnt == 0,
+	 * because there may have been a concurrent VOP_CLOSE.
+	 */
+	spec_io_drain(vp, sn);
 }
 
 /*

>From 70301771a4edb9b6124eaca72747fefd0baf498e Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sun, 23 Jan 2022 19:30:41 +0000
Subject: [PATCH 30/32] specfs: Take an I/O reference across bdev/cdev_open.

Needed because revoke must wait for all I/O operations, including
pending opens (which can hang indefinitely, e.g. ttys), to stop using
data structures before detach can free them.
---
 sys/miscfs/specfs/spec_vnops.c | 38 +++++++++++++++++++++++++++-------
 1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 0e07424426e5..ee07a8a91659 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -640,10 +640,10 @@ spec_open(void *v)
 	} */ *ap = v;
 	struct lwp *l = curlwp;
 	struct vnode *vp = ap->a_vp;
-	dev_t dev;
+	dev_t dev, dev1;
 	int error;
 	enum kauth_device_req req;
-	specnode_t *sn;
+	specnode_t *sn, *sn1;
 	specdev_t *sd;
 	spec_ioctl_t ioctl;
 	u_int gen = 0;
@@ -742,18 +742,34 @@ spec_open(void *v)
 	}
 
 	/*
-	 * Open the device.  If .d_open returns ENXIO (device not
-	 * configured), the driver may not be loaded, so try
-	 * autoloading a module and then try .d_open again if anything
-	 * got loaded.
-	 *
 	 * Because the opening the device may block indefinitely,
 	 * e.g. when opening a tty, and loading a module may cross into
 	 * many other subsystems, we must not hold the vnode lock while
 	 * calling .d_open, so release it now and reacquire it when
 	 * done.
+	 *
+	 * Take an I/O reference so that any concurrent
+	 * spec_node_revoke will wait for us to finish calling .d_open.
+	 * The vnode can't be dead at this point because we have it
+	 * locked.  Note that if revoked, the driver will interrupt
+	 * .d_open before spec_node_revoke starts waiting for I/O to
+	 * drain so this doesn't deadlock.
 	 */
 	VOP_UNLOCK(vp);
+	error = spec_io_enter(vp, &sn1, &dev1);
+	if (error) {
+		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
+		return error;
+	}
+	KASSERT(sn1 == sn);
+	KASSERT(dev1 == dev);
+
+	/*
+	 * Open the device.  If .d_open returns ENXIO (device not
+	 * configured), the driver may not be loaded, so try
+	 * autoloading a module and then try .d_open again if anything
+	 * got loaded.
+	 */
 	switch (vp->v_type) {
 	case VCHR:
 		do {
@@ -808,6 +824,14 @@ spec_open(void *v)
 	default:
 		__unreachable();
 	}
+
+	/*
+	 * Release the I/O reference now that we have called .d_open,
+	 * and reacquire the vnode lock.  At this point, the device may
+	 * have been revoked, so we must tread carefully -- can't touch
+	 * sn or sd until we verify the vnode is not dead.
+	 */
+	spec_io_exit(vp, sn);
 	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 
 	/*

>From 92a526b7f5f73b72b259a6f824c412dc2a2c787e Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sun, 23 Jan 2022 19:41:12 +0000
Subject: [PATCH 31/32] specfs: Take an I/O reference in the blocking
 spec_close path too.

Concurrent spec_node_revoke has to wait for this to make sure the
last close has completed before detach can free data structures it
might use.

It is unfortunate that there _is_ a blocking path at all -- it can
hold up the entire file system while we detach.
---
 sys/miscfs/specfs/spec_vnops.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index ee07a8a91659..a56d0d491ae6 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -1465,10 +1465,10 @@ spec_close(void *v)
 	} */ *ap = v;
 	struct vnode *vp = ap->a_vp;
 	struct session *sess;
-	dev_t dev;
+	dev_t dev, dev1;
 	int flags = ap->a_fflag;
 	int mode, error, count;
-	specnode_t *sn;
+	specnode_t *sn, *sn1;
 	specdev_t *sd;
 
 	KASSERT(VOP_ISLOCKED(vp) == LK_EXCLUSIVE ||
@@ -1601,16 +1601,26 @@ spec_close(void *v)
 	 * might end up sleeping for someone else who wants our queues. They
 	 * won't get them if we hold the vnode locked.
 	 */
-	if (!(flags & FNONBLOCK))
+	if (!(flags & FNONBLOCK)) {
 		VOP_UNLOCK(vp);
+		error = spec_io_enter(vp, &sn1, &dev1);
+		if (error) {
+			vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
+			return error;
+		}
+		KASSERT(sn1 == sn);
+		KASSERT(dev1 == dev);
+	}
 
 	if (vp->v_type == VBLK)
 		error = bdev_close(dev, flags, mode, curlwp);
 	else
 		error = cdev_close(dev, flags, mode, curlwp);
 
-	if (!(flags & FNONBLOCK))
+	if (!(flags & FNONBLOCK)) {
+		spec_io_exit(vp, sn1);
 		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
+	}
 
 	return (error);
 }

>From aba70cb45e823fbda47be0770558510113c617f0 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sun, 23 Jan 2022 20:05:52 +0000
Subject: [PATCH 32/32] specfs: In spec_close, take the I/O reference _before_
 VOP_UNLOCK.

This guarantees that spec_node_revoke will wait for the last normal
close -- there's no race between VOP_UNLOCK and spec_io_enter where
spec_node_revoke thinks the device is closed (sn_opencnt=0) and
there's no I/O references.

This is safe because spec_io_enter never blocks; it's only forbidden
to take vn_lock while holding spec_io_enter.
---
 sys/miscfs/specfs/spec_vnops.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index a56d0d491ae6..987cd5969d95 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -1602,14 +1602,16 @@ spec_close(void *v)
 	 * won't get them if we hold the vnode locked.
 	 */
 	if (!(flags & FNONBLOCK)) {
-		VOP_UNLOCK(vp);
+		/*
+		 * Take an I/O reference while we hold the vnode lock.
+		 * This ensures that spec_node_revoke waits for the
+		 * last close.
+		 */
 		error = spec_io_enter(vp, &sn1, &dev1);
-		if (error) {
-			vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
-			return error;
-		}
+		KASSERTMSG(error == 0, "error=%d", error);
 		KASSERT(sn1 == sn);
 		KASSERT(dev1 == dev);
+		VOP_UNLOCK(vp);
 	}
 
 	if (vp->v_type == VBLK)
diff --git a/external/cddl/osnet/dev/dtrace/dtrace_modevent.c b/external/cddl/osnet/dev/dtrace/dtrace_modevent.c
index cc0f8103fb98..f1338840125a 100644
--- a/external/cddl/osnet/dev/dtrace/dtrace_modevent.c
+++ b/external/cddl/osnet/dev/dtrace/dtrace_modevent.c
@@ -42,9 +42,7 @@ dtrace_modcmd(modcmd_t cmd, void *data)
 		return error;
 
 	case MODULE_CMD_FINI:
-		error = devsw_detach(NULL, &dtrace_cdevsw);
-		if (error != 0)
-			return error;
+		devsw_detach(NULL, &dtrace_cdevsw);
 
 		error = dtrace_unload();
 		if (error != 0) {
diff --git a/external/cddl/osnet/dev/fbt/fbt.c b/external/cddl/osnet/dev/fbt/fbt.c
index b367c2155292..46dd7c1f7f06 100644
--- a/external/cddl/osnet/dev/fbt/fbt.c
+++ b/external/cddl/osnet/dev/fbt/fbt.c
@@ -1329,7 +1329,8 @@ dtrace_fbt_modcmd(modcmd_t cmd, void *data)
 		error = fbt_unload();
 		if (error != 0)
 			return error;
-		return devsw_detach(NULL, &fbt_cdevsw);
+		devsw_detach(NULL, &fbt_cdevsw);
+		return 0;
 	case MODULE_CMD_AUTOUNLOAD:
 		return EBUSY;
 	default:
diff --git a/external/cddl/osnet/dev/sdt/sdt.c b/external/cddl/osnet/dev/sdt/sdt.c
index c3ad129f8284..5a41270a2917 100644
--- a/external/cddl/osnet/dev/sdt/sdt.c
+++ b/external/cddl/osnet/dev/sdt/sdt.c
@@ -562,7 +562,8 @@ dtrace_sdt_modcmd(modcmd_t cmd, void *data)
 		error = sdt_unload();
 		if (error != 0)
 			return error;
-		return devsw_detach(NULL, &sdt_cdevsw);
+		devsw_detach(NULL, &sdt_cdevsw);
+		return 0;
 	case MODULE_CMD_AUTOUNLOAD:
 		return EBUSY;
 	default:
diff --git a/external/cddl/osnet/dist/uts/common/fs/zfs/zfs_ioctl.c b/external/cddl/osnet/dist/uts/common/fs/zfs/zfs_ioctl.c
index 9e19cd1dc0c3..d74d8c71e54d 100644
--- a/external/cddl/osnet/dist/uts/common/fs/zfs/zfs_ioctl.c
+++ b/external/cddl/osnet/dist/uts/common/fs/zfs/zfs_ioctl.c
@@ -7231,7 +7231,7 @@ zfs_modcmd(modcmd_t cmd, void *arg)
 		if (error)
 			return error;
 
-		(void) devsw_detach(&zfs_bdevsw, &zfs_cdevsw);
+		devsw_detach(&zfs_bdevsw, &zfs_cdevsw);
 
 attacherr:
 		zfs_sysctl_fini();
diff --git a/share/man/man9/devsw_attach.9 b/share/man/man9/devsw_attach.9
index cf862be5846a..6ffc51957a3f 100644
--- a/share/man/man9/devsw_attach.9
+++ b/share/man/man9/devsw_attach.9
@@ -27,7 +27,7 @@
 .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 .\" POSSIBILITY OF SUCH DAMAGE.
 .\"
-.Dd April 30, 2017
+.Dd January 11, 2022
 .Dt DEVSW 9
 .Os
 .Sh NAME
@@ -49,7 +49,7 @@
 .Fa "const struct cdevsw *cdev"
 .Fa "devmajor_t *cmajor"
 .Fc
-.Ft int
+.Ft void
 .Fo devsw_detach
 .Fa "const struct bdevsw *bdev"
 .Fa "const struct cdevsw *cdev"
@@ -130,6 +130,11 @@ and
 structures.
 .Fn devsw_detach
 should be called before a loaded device driver is unloaded.
+The caller must ensure that there are no open instances of the device,
+and that the device's
+.Fn d_open
+function will fail, before calling.
+Fn devsw_detach .
 .Pp
 The
 .Fn bdevsw_lookup
@@ -155,10 +160,8 @@ or
 .Sh RETURN VALUES
 Upon successful completion,
 .Fn devsw_attach
-and
-.Fn devsw_detach
-return 0.
-Otherwise they return an error value.
+returns 0.
+Otherwise it returns an error value.
 .Pp
 In case of failure,
 .Fn bdevsw_lookup
diff --git a/sys/coda/coda_psdev.c b/sys/coda/coda_psdev.c
index cede16da3f53..7f531f03fe56 100644
--- a/sys/coda/coda_psdev.c
+++ b/sys/coda/coda_psdev.c
@@ -758,7 +758,7 @@ vcoda_modcmd(modcmd_t cmd, void *arg)
 				if (VC_OPEN(vcp))
 					return EBUSY;
 			}
-			return devsw_detach(NULL, &vcoda_cdevsw);
+			devsw_detach(NULL, &vcoda_cdevsw);
 		}
 #endif
 		break;
diff --git a/sys/dev/ccd.c b/sys/dev/ccd.c
index 05945f9a67ba..2283bc0346da 100644
--- a/sys/dev/ccd.c
+++ b/sys/dev/ccd.c
@@ -1710,7 +1710,7 @@ ccd_modcmd(modcmd_t cmd, void *arg)
 			error = EBUSY;
 		} else {
 			mutex_exit(&ccd_lock);
-			error = devsw_detach(&ccd_bdevsw, &ccd_cdevsw);
+			devsw_detach(&ccd_bdevsw, &ccd_cdevsw);
 			ccddetach();
 		}
 #endif
diff --git a/sys/dev/clockctl.c b/sys/dev/clockctl.c
index 0da5e7765fe8..9685c0f129f6 100644
--- a/sys/dev/clockctl.c
+++ b/sys/dev/clockctl.c
@@ -182,14 +182,12 @@ clockctl_modcmd(modcmd_t cmd, void *data)
 			return EBUSY;
 		}
 #ifdef _MODULE
-		error = devsw_detach(NULL, &clockctl_cdevsw);
+		devsw_detach(NULL, &clockctl_cdevsw);
 #endif
 		mutex_exit(&clockctl_mtx);
 
-		if (error == 0) {
-			kauth_unlisten_scope(clockctl_listener);
-			mutex_destroy(&clockctl_mtx);
-		}
+		kauth_unlisten_scope(clockctl_listener);
+		mutex_destroy(&clockctl_mtx);
 		break;
 
 	default:
diff --git a/sys/dev/hdaudio/hdaudio.c b/sys/dev/hdaudio/hdaudio.c
index d39ff2db6cde..5c7874778e22 100644
--- a/sys/dev/hdaudio/hdaudio.c
+++ b/sys/dev/hdaudio/hdaudio.c
@@ -1636,11 +1636,7 @@ hdaudio_modcmd(modcmd_t cmd, void *opaque)
 		error = config_cfdriver_detach(&hdaudio_cd);
 		if (error)
 			break;
-		error = devsw_detach(NULL, &hdaudio_cdevsw);
-		if (error) {
-			config_cfdriver_attach(&hdaudio_cd);
-			break;
-		}
+		devsw_detach(NULL, &hdaudio_cdevsw);
 #endif
 		break;
 	default:
diff --git a/sys/dev/i2c/i2c.c b/sys/dev/i2c/i2c.c
index 6f2c0c6a9698..36a3e87d5316 100644
--- a/sys/dev/i2c/i2c.c
+++ b/sys/dev/i2c/i2c.c
@@ -942,7 +942,7 @@ iic_modcmd(modcmd_t cmd, void *opaque)
 		if (error) {
 			aprint_error("%s: unable to init component\n",
 			    iic_cd.cd_name);
-			(void)devsw_detach(NULL, &iic_cdevsw);
+			devsw_detach(NULL, &iic_cdevsw);
 		}
 		mutex_exit(&iic_mtx);
 #endif
@@ -960,10 +960,7 @@ iic_modcmd(modcmd_t cmd, void *opaque)
 			mutex_exit(&iic_mtx);
 			break;
 		}
-		error = devsw_detach(NULL, &iic_cdevsw);
-		if (error != 0)
-			config_init_component(cfdriver_ioconf_iic,
-			    cfattach_ioconf_iic, cfdata_ioconf_iic);
+		devsw_detach(NULL, &iic_cdevsw);
 #endif
 		mutex_exit(&iic_mtx);
 		break;
diff --git a/sys/dev/pad/pad.c b/sys/dev/pad/pad.c
index fe0b429cf386..a779f1f71b8d 100644
--- a/sys/dev/pad/pad.c
+++ b/sys/dev/pad/pad.c
@@ -777,9 +777,7 @@ pad_modcmd(modcmd_t cmd, void *arg)
 
 	case MODULE_CMD_FINI:
 #ifdef _MODULE
-		error = devsw_detach(NULL, &pad_cdevsw);
-		if (error)
-			break;
+		devsw_detach(NULL, &pad_cdevsw);
 
 		error = config_fini_component(cfdriver_ioconf_pad,
 		    pad_cfattach, cfdata_ioconf_pad);
diff --git a/sys/dev/raidframe/rf_netbsdkintf.c b/sys/dev/raidframe/rf_netbsdkintf.c
index d1bda3553e03..87439aa70bfb 100644
--- a/sys/dev/raidframe/rf_netbsdkintf.c
+++ b/sys/dev/raidframe/rf_netbsdkintf.c
@@ -4088,16 +4088,7 @@ raid_modcmd_fini(void)
 		return error;
 	}
 #endif
-	error = devsw_detach(&raid_bdevsw, &raid_cdevsw);
-	if (error != 0) {
-		aprint_error("%s: cannot detach devsw\n",__func__);
-#ifdef _MODULE
-		config_cfdriver_attach(&raid_cd);
-#endif
-		config_cfattach_attach(raid_cd.cd_name, &raid_ca);
-		mutex_exit(&raid_lock);
-		return error;
-	}
+	devsw_detach(&raid_bdevsw, &raid_cdevsw);
 	rf_BootRaidframe(false);
 #if (RF_INCLUDE_PARITY_DECLUSTERING_DS > 0)
 	rf_destroy_mutex2(rf_sparet_wait_mutex);
diff --git a/sys/dev/sysmon/sysmon.c b/sys/dev/sysmon/sysmon.c
index 46aedaad7337..fd2993f0c180 100644
--- a/sys/dev/sysmon/sysmon.c
+++ b/sys/dev/sysmon/sysmon.c
@@ -356,7 +356,7 @@ sysmon_fini(void)
 	if (error == 0) {
 		mutex_enter(&sysmon_minor_mtx);
 		sm_is_attached = false;
-		error = devsw_detach(NULL, &sysmon_cdevsw);
+		devsw_detach(NULL, &sysmon_cdevsw);
 		mutex_exit(&sysmon_minor_mtx);
 	}
 #endif
diff --git a/sys/dev/tprof/tprof.c b/sys/dev/tprof/tprof.c
index b069a5b7df5d..136fd190ad14 100644
--- a/sys/dev/tprof/tprof.c
+++ b/sys/dev/tprof/tprof.c
@@ -768,13 +768,7 @@ tprof_modcmd(modcmd_t cmd, void *arg)
 
 	case MODULE_CMD_FINI:
 #if defined(_MODULE)
-		{
-			int error;
-			error = devsw_detach(NULL, &tprof_cdevsw);
-			if (error) {
-				return error;
-			}
-		}
+		devsw_detach(NULL, &tprof_cdevsw);
 #endif /* defined(_MODULE) */
 		tprof_driver_fini();
 		return 0;
diff --git a/sys/dist/pf/net/pf_ioctl.c b/sys/dist/pf/net/pf_ioctl.c
index 94bfb70a411d..e4c13be698f8 100644
--- a/sys/dist/pf/net/pf_ioctl.c
+++ b/sys/dist/pf/net/pf_ioctl.c
@@ -3459,7 +3459,8 @@ pf_modcmd(modcmd_t cmd, void *opaque)
 		} else {
 			pfdetach();
 			pflogdetach();
-			return devsw_detach(NULL, &pf_cdevsw);
+			devsw_detach(NULL, &pf_cdevsw);
+			return 0;
 		}
 	default:
 		return ENOTTY;
diff --git a/sys/external/bsd/ipf/netinet/ip_fil_netbsd.c b/sys/external/bsd/ipf/netinet/ip_fil_netbsd.c
index d0c4ca95097c..bb4e0706cc9b 100644
--- a/sys/external/bsd/ipf/netinet/ip_fil_netbsd.c
+++ b/sys/external/bsd/ipf/netinet/ip_fil_netbsd.c
@@ -2256,7 +2256,7 @@ ipl_fini(void *opaque)
 {
 
 #ifdef _MODULE
-	(void)devsw_detach(NULL, &ipl_cdevsw);
+	devsw_detach(NULL, &ipl_cdevsw);
 #endif
 
 	/*
diff --git a/sys/fs/autofs/autofs_vfsops.c b/sys/fs/autofs/autofs_vfsops.c
index fbd6eafe6532..1204d1f9b6d3 100644
--- a/sys/fs/autofs/autofs_vfsops.c
+++ b/sys/fs/autofs/autofs_vfsops.c
@@ -496,9 +496,7 @@ autofs_modcmd(modcmd_t cmd, void *arg)
 		}
 		mutex_exit(&autofs_softc->sc_lock);
 
-		error = devsw_detach(NULL, &autofs_cdevsw);
-		if (error)
-			break;
+		devsw_detach(NULL, &autofs_cdevsw);
 #endif
 		error = vfs_detach(&autofs_vfsops);
 		if (error)
diff --git a/sys/kern/kern_drvctl.c b/sys/kern/kern_drvctl.c
index 37f4730b2512..8a4156f8a0aa 100644
--- a/sys/kern/kern_drvctl.c
+++ b/sys/kern/kern_drvctl.c
@@ -665,15 +665,10 @@ drvctl_modcmd(modcmd_t cmd, void *arg)
 		devmon_insert_vec = saved_insert_vec;
 		saved_insert_vec = NULL;
 #ifdef _MODULE
-		error = devsw_detach(NULL, &drvctl_cdevsw);
-		if (error != 0) {
-			saved_insert_vec = devmon_insert_vec;
-			devmon_insert_vec = devmon_insert;
-		}
+		devsw_detach(NULL, &drvctl_cdevsw);
 #endif
 		mutex_exit(&drvctl_lock);
-		if (error == 0)
-			drvctl_fini();
+		drvctl_fini();
 
 		break;
 	default:
diff --git a/sys/kern/subr_autoconf.c b/sys/kern/subr_autoconf.c
index 8439ae8c6a00..8c0b66556e0f 100644
--- a/sys/kern/subr_autoconf.c
+++ b/sys/kern/subr_autoconf.c
@@ -108,6 +108,7 @@ __KERNEL_RCSID(0, "$NetBSD: subr_autoconf.c,v 1.291 2021/12/31 14:19:57 riastrad
 #include <sys/cpu.h>
 #include <sys/sysctl.h>
 #include <sys/stdarg.h>
+#include <sys/localcount.h>
 
 #include <sys/disk.h>
 
@@ -1453,6 +1454,9 @@ config_devdelete(device_t dev)
 	if (dg->dg_devs != NULL)
 		kmem_free(dg->dg_devs, sizeof(device_t) * dg->dg_ndevs);
 
+	localcount_fini(dev->dv_localcount);
+	kmem_free(dev->dv_localcount, sizeof(*dev->dv_localcount));
+
 	cv_destroy(&dvl->dvl_cv);
 	mutex_destroy(&dvl->dvl_mtx);
 
@@ -1556,6 +1560,7 @@ config_devalloc(const device_t parent, const cfdata_t cf,
 	dev->dv_activity_handlers = NULL;
 	dev->dv_private = dev_private;
 	dev->dv_flags = ca->ca_flags;	/* inherit flags from class */
+	dev->dv_attaching = curlwp;
 
 	myunit = config_unit_alloc(dev, cd, cf);
 	if (myunit == -1) {
@@ -1604,6 +1609,10 @@ config_devalloc(const device_t parent, const cfdata_t cf,
 		    "device-parent", device_xname(parent));
 	}
 
+	dev->dv_localcount = kmem_zalloc(sizeof(*dev->dv_localcount),
+	    KM_SLEEP);
+	localcount_init(dev->dv_localcount);
+
 	if (dev->dv_cfdriver->cd_attrs != NULL)
 		config_add_attrib_dict(dev);
 
@@ -1755,8 +1764,29 @@ config_attach_internal(device_t parent, cfdata_t cf, void *aux, cfprint_t print,
 	/* Let userland know */
 	devmon_report_device(dev, true);
 
+	/*
+	 * Prevent detach until the driver's attach function, and all
+	 * deferred actions, have finished.
+	 */
 	config_pending_incr(dev);
+
+	/* Call the driver's attach function.  */
 	(*dev->dv_cfattach->ca_attach)(parent, dev, aux);
+
+	/*
+	 * Allow other threads to acquire references to the device now
+	 * that the driver's attach function is done.
+	 */
+	mutex_enter(&config_misc_lock);
+	KASSERT(dev->dv_attaching == curlwp);
+	dev->dv_attaching = NULL;
+	cv_broadcast(&config_misc_cv);
+	mutex_exit(&config_misc_lock);
+
+	/*
+	 * Synchronous parts of attach are done.  Allow detach, unless
+	 * the driver's attach function scheduled deferred actions.
+	 */
 	config_pending_decr(dev);
 
 	mutex_enter(&config_misc_lock);
@@ -1822,8 +1852,29 @@ config_attach_pseudo(cfdata_t cf)
 	/* Let userland know */
 	devmon_report_device(dev, true);
 
+	/*
+	 * Prevent detach until the driver's attach function, and all
+	 * deferred actions, have finished.
+	 */
 	config_pending_incr(dev);
+
+	/* Call the driver's attach function.  */
 	(*dev->dv_cfattach->ca_attach)(ROOT, dev, NULL);
+
+	/*
+	 * Allow other threads to acquire references to the device now
+	 * that the driver's attach function is done.
+	 */
+	mutex_enter(&config_misc_lock);
+	KASSERT(dev->dv_attaching == curlwp);
+	dev->dv_attaching = NULL;
+	cv_broadcast(&config_misc_cv);
+	mutex_exit(&config_misc_lock);
+
+	/*
+	 * Synchronous parts of attach are done.  Allow detach, unless
+	 * the driver's attach function scheduled deferred actions.
+	 */
 	config_pending_decr(dev);
 
 	config_process_deferred(&deferred_config_queue, dev);
@@ -1872,24 +1923,39 @@ config_dump_garbage(struct devicelist *garbage)
 static int
 config_detach_enter(device_t dev)
 {
-	int error;
+	int error = 0;
 
 	mutex_enter(&config_misc_lock);
-	for (;;) {
-		if (dev->dv_pending == 0 && dev->dv_detaching == NULL) {
-			dev->dv_detaching = curlwp;
-			error = 0;
-			break;
-		}
+
+	/*
+	 * Wait until attach has fully completed, and until any
+	 * concurrent detach (e.g., drvctl racing with USB event
+	 * thread) has completed.
+	 *
+	 * Caller must hold alldevs_nread or alldevs_nwrite (e.g., via
+	 * deviter) to ensure the winner of the race doesn't free the
+	 * device leading the loser of the race into use-after-free.
+	 *
+	 * XXX Not all callers do this!
+	 */
+	while (dev->dv_pending || dev->dv_detaching) {
 		KASSERTMSG(dev->dv_detaching != curlwp,
 		    "recursively detaching %s", device_xname(dev));
 		error = cv_wait_sig(&config_misc_cv, &config_misc_lock);
 		if (error)
-			break;
+			goto out;
 	}
-	KASSERT(error || dev->dv_detaching == curlwp);
-	mutex_exit(&config_misc_lock);
 
+	/*
+	 * Attach has completed, and no other concurrent detach is
+	 * running.  Claim the device for detaching.  This will cause
+	 * all new attempts to acquire references to block.
+	 */
+	KASSERT(dev->dv_attaching == NULL);
+	KASSERT(dev->dv_detaching == NULL);
+	dev->dv_detaching = curlwp;
+
+out:	mutex_exit(&config_misc_lock);
 	return error;
 }
 
@@ -1980,9 +2046,10 @@ config_detach(device_t dev, int flags)
 	 */
 	if (rv == 0)
 		dev->dv_flags &= ~DVF_ACTIVE;
-	else if ((flags & DETACH_FORCE) == 0)
+	else if ((flags & DETACH_FORCE) == 0) {
+		/* Detach failed -- likely EBUSY.  */
 		goto out;
-	else {
+	} else {
 		panic("config_detach: forced detach of %s failed (%d)",
 		    device_xname(dev), rv);
 	}
@@ -1991,6 +2058,19 @@ config_detach(device_t dev, int flags)
 	 * The device has now been successfully detached.
 	 */
 
+	/*
+	 * Wait for all device_lookup_acquire references -- mostly, for
+	 * all attempts to open the device -- to drain.  It is the
+	 * responsibility of .ca_detach to ensure anything with open
+	 * references will be interrupted and release them promptly,
+	 * not block indefinitely.  All new attempts to acquire
+	 * references will block until dv_detaching clears.
+	 */
+	mutex_enter(&config_misc_lock);
+	localcount_drain(dev->dv_localcount,
+	    &config_misc_cv, &config_misc_lock);
+	mutex_exit(&config_misc_lock);
+
 	/* Let userland know */
 	devmon_report_device(dev, false);
 
@@ -2498,6 +2578,14 @@ config_alldevs_exit(struct alldevs_foray *af)
  * device_lookup:
  *
  *	Look up a device instance for a given driver.
+ *
+ *	Caller is responsible for ensuring the device's state is
+ *	stable, either by holding a reference already obtained with
+ *	device_lookup_acquire or by otherwise ensuring the device is
+ *	attached and can't be detached (e.g., holding an open device
+ *	node and ensuring *_detach calls vdevgone).
+ *
+ *	XXX Find a way to assert this.
  */
 device_t
 device_lookup(cfdriver_t cd, int unit)
@@ -2526,6 +2614,69 @@ device_lookup_private(cfdriver_t cd, int unit)
 	return device_private(device_lookup(cd, unit));
 }
 
+/*
+ * device_lookup_acquire:
+ *
+ *	Look up a device instance for a given driver, and return a
+ *	reference to it that must be released by device_release.
+ *
+ *	=> If the device is still attaching, blocks until *_attach has
+ *	   returned.
+ *
+ *	=> If the device is detaching, blocks until *_detach has
+ *	   returned.  May succeed or fail in that case, depending on
+ *	   whether *_detach has backed out (EBUSY) or committed to
+ *	   detaching.
+ */
+device_t
+device_lookup_acquire(cfdriver_t cd, int unit)
+{
+	device_t dv;
+
+	/* XXX This should have a pserialized fast path -- TBD.  */
+	mutex_enter(&config_misc_lock);
+	mutex_enter(&alldevs_lock);
+retry:	if (unit < 0 || unit >= cd->cd_ndevs ||
+	    (dv = cd->cd_devs[unit]) == NULL ||
+	    dv->dv_del_gen != 0) {
+		dv = NULL;
+	} else {
+		/*
+		 * Wait for the device to stabilize, if attaching or
+		 * detaching.  Either way we must wait for *_attach or
+		 * *_detach to complete, and either way we must retry:
+		 * even if detaching, *_detach might fail (EBUSY) so
+		 * the device may still be there.
+		 */
+		if ((dv->dv_attaching != NULL && dv->dv_attaching != curlwp) ||
+		    dv->dv_detaching != NULL) {
+			mutex_exit(&alldevs_lock);
+			cv_wait(&config_misc_cv, &config_misc_lock);
+			mutex_enter(&alldevs_lock);
+			goto retry;
+		}
+		localcount_acquire(dv->dv_localcount);
+	}
+	mutex_exit(&alldevs_lock);
+	mutex_exit(&config_misc_lock);
+
+	return dv;
+}
+
+/*
+ * device_release:
+ *
+ *	Release a reference to a device acquired with
+ *	device_lookup_acquire.
+ */
+void
+device_release(device_t dv)
+{
+
+	localcount_release(dv->dv_localcount,
+	    &config_misc_cv, &config_misc_lock);
+}
+
 /*
  * device_find_by_xname:
  *
diff --git a/sys/kern/subr_devsw.c b/sys/kern/subr_devsw.c
index 1a0f721fdd65..8b55187b32c1 100644
--- a/sys/kern/subr_devsw.c
+++ b/sys/kern/subr_devsw.c
@@ -85,6 +85,11 @@ __KERNEL_RCSID(0, "$NetBSD: subr_devsw.c,v 1.38 2017/11/07 18:35:57 christos Exp
 #include <sys/buf.h>
 #include <sys/reboot.h>
 #include <sys/sdt.h>
+#include <sys/atomic.h>
+#include <sys/localcount.h>
+#include <sys/pserialize.h>
+#include <sys/xcall.h>
+#include <sys/device.h>
 
 #ifdef DEVSW_DEBUG
 #define	DPRINTF(x)	printf x
@@ -97,12 +102,22 @@ __KERNEL_RCSID(0, "$NetBSD: subr_devsw.c,v 1.38 2017/11/07 18:35:57 christos Exp
 #define	CDEVSW_SIZE	(sizeof(struct cdevsw *))
 #define	DEVSWCONV_SIZE	(sizeof(struct devsw_conv))
 
+struct devswref {
+	struct localcount	*dr_lc;
+	bool			dr_dynamic;
+};
+
+/* XXX bdevsw, cdevsw, max_bdevsws, and max_cdevsws should be volatile */
 extern const struct bdevsw **bdevsw, *bdevsw0[];
 extern const struct cdevsw **cdevsw, *cdevsw0[];
 extern struct devsw_conv *devsw_conv, devsw_conv0[];
 extern const int sys_bdevsws, sys_cdevsws;
 extern int max_bdevsws, max_cdevsws, max_devsw_convs;
 
+static struct devswref *cdevswref;
+static struct devswref *bdevswref;
+static kcondvar_t devsw_cv;
+
 static int bdevsw_attach(const struct bdevsw *, devmajor_t *);
 static int cdevsw_attach(const struct cdevsw *, devmajor_t *);
 static void devsw_detach_locked(const struct bdevsw *, const struct cdevsw *);
@@ -118,6 +133,8 @@ devsw_init(void)
 	KASSERT(sys_bdevsws < MAXDEVSW - 1);
 	KASSERT(sys_cdevsws < MAXDEVSW - 1);
 	mutex_init(&device_lock, MUTEX_DEFAULT, IPL_NONE);
+
+	cv_init(&devsw_cv, "devsw");
 }
 
 int
@@ -158,15 +175,12 @@ devsw_attach(const char *devname,
 			error = EEXIST;
 			goto fail;
 		}
-
-		if (bdev != NULL)
-			bdevsw[*bmajor] = bdev;
-		cdevsw[*cmajor] = cdev;
-
-		mutex_exit(&device_lock);
-		return (0);
 	}
 
+	/*
+	 * XXX This should allocate what it needs up front so we never
+	 * need to flail around trying to unwind.
+	 */
 	error = bdevsw_attach(bdev, bmajor);
 	if (error != 0) 
 		goto fail;
@@ -176,6 +190,13 @@ devsw_attach(const char *devname,
 		goto fail;
 	}
 
+	/*
+	 * If we already found a conv, we're done.  Otherwise, find an
+	 * empty slot or extend the table.
+	 */
+	if (i == max_devsw_convs)
+		goto fail;
+
 	for (i = 0 ; i < max_devsw_convs ; i++) {
 		if (devsw_conv[i].d_name == NULL)
 			break;
@@ -224,7 +245,9 @@ devsw_attach(const char *devname,
 static int
 bdevsw_attach(const struct bdevsw *devsw, devmajor_t *devmajor)
 {
-	const struct bdevsw **newptr;
+	const struct bdevsw **newbdevsw = NULL;
+	struct devswref *newbdevswref = NULL;
+	struct localcount *lc;
 	devmajor_t bmajor;
 	int i;
 
@@ -253,20 +276,35 @@ bdevsw_attach(const struct bdevsw *devsw, devmajor_t *devmajor)
 		return (ENOMEM);
 	}
 
+	if (bdevswref == NULL) {
+		newbdevswref = kmem_zalloc(MAXDEVSW * sizeof(newbdevswref[0]),
+		    KM_NOSLEEP);
+		if (newbdevswref == NULL)
+			return ENOMEM;
+		atomic_store_release(&bdevswref, newbdevswref);
+	}
+
 	if (*devmajor >= max_bdevsws) {
 		KASSERT(bdevsw == bdevsw0);
-		newptr = kmem_zalloc(MAXDEVSW * BDEVSW_SIZE, KM_NOSLEEP);
-		if (newptr == NULL)
-			return (ENOMEM);
-		memcpy(newptr, bdevsw, max_bdevsws * BDEVSW_SIZE);
-		bdevsw = newptr;
-		max_bdevsws = MAXDEVSW;
+		newbdevsw = kmem_zalloc(MAXDEVSW * sizeof(newbdevsw[0]),
+		    KM_NOSLEEP);
+                if (newbdevsw == NULL)
+                        return ENOMEM;
+		memcpy(newbdevsw, bdevsw, max_bdevsws * sizeof(bdevsw[0]));
+		atomic_store_release(&bdevsw, newbdevsw);
+		atomic_store_release(&max_bdevsws, MAXDEVSW);
 	}
 
 	if (bdevsw[*devmajor] != NULL)
 		return (EEXIST);
 
-	bdevsw[*devmajor] = devsw;
+	KASSERT(bdevswref[*devmajor].dr_lc == NULL);
+	lc = kmem_zalloc(sizeof(*lc), KM_SLEEP);
+	localcount_init(lc);
+	bdevswref[*devmajor].dr_lc = lc;
+	bdevswref[*devmajor].dr_dynamic = true;
+
+	atomic_store_release(&bdevsw[*devmajor], devsw);
 
 	return (0);
 }
@@ -274,7 +312,9 @@ bdevsw_attach(const struct bdevsw *devsw, devmajor_t *devmajor)
 static int
 cdevsw_attach(const struct cdevsw *devsw, devmajor_t *devmajor)
 {
-	const struct cdevsw **newptr;
+	const struct cdevsw **newcdevsw = NULL;
+	struct devswref *newcdevswref = NULL;
+	struct localcount *lc;
 	devmajor_t cmajor;
 	int i;
 
@@ -300,20 +340,35 @@ cdevsw_attach(const struct cdevsw *devsw, devmajor_t *devmajor)
 		return (ENOMEM);
 	}
 
+	if (cdevswref == NULL) {
+		newcdevswref = kmem_zalloc(MAXDEVSW * sizeof(newcdevswref[0]),
+		    KM_NOSLEEP);
+		if (newcdevswref == NULL)
+			return ENOMEM;
+		atomic_store_release(&cdevswref, newcdevswref);
+	}
+
 	if (*devmajor >= max_cdevsws) {
 		KASSERT(cdevsw == cdevsw0);
-		newptr = kmem_zalloc(MAXDEVSW * CDEVSW_SIZE, KM_NOSLEEP);
-		if (newptr == NULL)
-			return (ENOMEM);
-		memcpy(newptr, cdevsw, max_cdevsws * CDEVSW_SIZE);
-		cdevsw = newptr;
-		max_cdevsws = MAXDEVSW;
+		newcdevsw = kmem_zalloc(MAXDEVSW * sizeof(newcdevsw[0]),
+		    KM_NOSLEEP);
+                if (newcdevsw == NULL)
+                        return ENOMEM;
+		memcpy(newcdevsw, cdevsw, max_cdevsws * sizeof(cdevsw[0]));
+		atomic_store_release(&cdevsw, newcdevsw);
+		atomic_store_release(&max_cdevsws, MAXDEVSW);
 	}
 
 	if (cdevsw[*devmajor] != NULL)
 		return (EEXIST);
 
-	cdevsw[*devmajor] = devsw;
+	KASSERT(cdevswref[*devmajor].dr_lc == NULL);
+	lc = kmem_zalloc(sizeof(*lc), KM_SLEEP);
+	localcount_init(lc);
+	cdevswref[*devmajor].dr_lc = lc;
+	cdevswref[*devmajor].dr_dynamic = true;
+
+	atomic_store_release(&cdevsw[*devmajor], devsw);
 
 	return (0);
 }
@@ -321,36 +376,75 @@ cdevsw_attach(const struct cdevsw *devsw, devmajor_t *devmajor)
 static void
 devsw_detach_locked(const struct bdevsw *bdev, const struct cdevsw *cdev)
 {
-	int i;
+	int bi, ci = -1/*XXXGCC*/;
 
 	KASSERT(mutex_owned(&device_lock));
 
+	/* Prevent new references.  */
 	if (bdev != NULL) {
-		for (i = 0 ; i < max_bdevsws ; i++) {
-			if (bdevsw[i] != bdev)
+		for (bi = 0; bi < max_bdevsws; bi++) {
+			if (bdevsw[bi] != bdev)
 				continue;
-			bdevsw[i] = NULL;
+			atomic_store_relaxed(&bdevsw[bi], NULL);
 			break;
 		}
+		KASSERT(bi < max_bdevsws);
 	}
 	if (cdev != NULL) {
-		for (i = 0 ; i < max_cdevsws ; i++) {
-			if (cdevsw[i] != cdev)
+		for (ci = 0; ci < max_cdevsws; ci++) {
+			if (cdevsw[ci] != cdev)
 				continue;
-			cdevsw[i] = NULL;
+			atomic_store_relaxed(&cdevsw[ci], NULL);
 			break;
 		}
+		KASSERT(ci < max_cdevsws);
+	}
+
+	if (bdev == NULL && cdev == NULL) /* XXX possible? */
+		return;
+
+	/*
+	 * Wait for all bdevsw_lookup_acquire, cdevsw_lookup_acquire
+	 * calls to notice that the devsw is gone.
+	 *
+	 * XXX Can't use pserialize_perform here because devsw_init is
+	 * too early for pserialize_create().
+	 */
+	xc_barrier(0);
+
+	/*
+	 * Wait for all references to drain.  It is the caller's
+	 * responsibility to ensure that at this point, there are no
+	 * extant open instances and all new d_open calls will fail.
+	 *
+	 * Note that localcount_drain may release and reacquire
+	 * device_lock.
+	 */
+	if (bdev != NULL) {
+		localcount_drain(bdevswref[bi].dr_lc,
+		    &devsw_cv, &device_lock);
+		localcount_fini(bdevswref[bi].dr_lc);
+		kmem_free(bdevswref[bi].dr_lc, sizeof(*bdevswref[bi].dr_lc));
+		bdevswref[bi].dr_lc = NULL;
+		bdevswref[bi].dr_dynamic = false;
+	}
+	if (cdev != NULL) {
+		localcount_drain(cdevswref[ci].dr_lc,
+		    &devsw_cv, &device_lock);
+		localcount_fini(cdevswref[ci].dr_lc);
+		kmem_free(cdevswref[ci].dr_lc, sizeof(*cdevswref[ci].dr_lc));
+		cdevswref[ci].dr_lc = NULL;
+		cdevswref[ci].dr_dynamic = false;
 	}
 }
 
-int
+void
 devsw_detach(const struct bdevsw *bdev, const struct cdevsw *cdev)
 {
 
 	mutex_enter(&device_lock);
 	devsw_detach_locked(bdev, cdev);
 	mutex_exit(&device_lock);
-	return 0;
 }
 
 /*
@@ -366,10 +460,60 @@ bdevsw_lookup(dev_t dev)
 	if (dev == NODEV)
 		return (NULL);
 	bmajor = major(dev);
-	if (bmajor < 0 || bmajor >= max_bdevsws)
+	if (bmajor < 0 || bmajor >= atomic_load_relaxed(&max_bdevsws))
 		return (NULL);
 
-	return (bdevsw[bmajor]);
+	return atomic_load_consume(&bdevsw)[bmajor];
+}
+
+static const struct bdevsw *
+bdevsw_lookup_acquire(dev_t dev, struct localcount **lcp)
+{
+	devmajor_t bmajor;
+	const struct bdevsw *bdev = NULL, *const *curbdevsw;
+	struct devswref *curbdevswref;
+	int s;
+
+	if (dev == NODEV)
+		return NULL;
+	bmajor = major(dev);
+	if (bmajor < 0)
+		return NULL;
+
+	s = pserialize_read_enter();
+
+	/*
+	 * max_bdevsws never goes down, so it is safe to rely on this
+	 * condition without any locking for the array access below.
+	 * Test sys_bdevsws first so we can avoid the memory barrier in
+	 * that case.
+	 */
+	if (bmajor >= sys_bdevsws &&
+	    bmajor >= atomic_load_acquire(&max_bdevsws))
+		goto out;
+	curbdevsw = atomic_load_consume(&bdevsw);
+	if ((bdev = atomic_load_consume(&curbdevsw[bmajor])) == NULL)
+		goto out;
+
+	curbdevswref = atomic_load_consume(&bdevswref);
+	if (curbdevswref == NULL || !curbdevswref[bmajor].dr_dynamic) {
+		*lcp = NULL;
+	} else {
+		*lcp = curbdevswref[bmajor].dr_lc;
+		localcount_acquire(*lcp);
+	}
+
+out:	pserialize_read_exit(s);
+	return bdev;
+}
+
+static void
+bdevsw_release(const struct bdevsw *bdev, struct localcount *lc)
+{
+
+	if (lc == NULL)
+		return;
+	localcount_release(lc, &devsw_cv, &device_lock);
 }
 
 /*
@@ -385,10 +529,60 @@ cdevsw_lookup(dev_t dev)
 	if (dev == NODEV)
 		return (NULL);
 	cmajor = major(dev);
-	if (cmajor < 0 || cmajor >= max_cdevsws)
+	if (cmajor < 0 || cmajor >= atomic_load_relaxed(&max_cdevsws))
 		return (NULL);
 
-	return (cdevsw[cmajor]);
+	return atomic_load_consume(&cdevsw)[cmajor];
+}
+
+static const struct cdevsw *
+cdevsw_lookup_acquire(dev_t dev, struct localcount **lcp)
+{
+	devmajor_t cmajor;
+	const struct cdevsw *cdev = NULL, *const *curcdevsw;
+	struct devswref *curcdevswref;
+	int s;
+
+	if (dev == NODEV)
+		return NULL;
+	cmajor = major(dev);
+	if (cmajor < 0)
+		return NULL;
+
+	s = pserialize_read_enter();
+
+	/*
+	 * max_cdevsws never goes down, so it is safe to rely on this
+	 * condition without any locking for the array access below.
+	 * Test sys_cdevsws first so we can avoid the memory barrier in
+	 * that case.
+	 */
+	if (cmajor >= sys_cdevsws &&
+	    cmajor >= atomic_load_acquire(&max_cdevsws))
+		goto out;
+	curcdevsw = atomic_load_consume(&cdevsw);
+	if ((cdev = atomic_load_consume(&curcdevsw[cmajor])) == NULL)
+		goto out;
+
+	curcdevswref = atomic_load_consume(&cdevswref);
+	if (curcdevswref == NULL || !curcdevswref[cmajor].dr_dynamic) {
+		*lcp = NULL;
+	} else {
+		*lcp = curcdevswref[cmajor].dr_lc;
+		localcount_acquire(*lcp);
+	}
+
+out:	pserialize_read_exit(s);
+	return cdev;
+}
+
+static void
+cdevsw_release(const struct cdevsw *cdev, struct localcount *lc)
+{
+
+	if (lc == NULL)
+		return;
+	localcount_release(lc, &devsw_cv, &device_lock);
 }
 
 /*
@@ -400,10 +594,13 @@ cdevsw_lookup(dev_t dev)
 devmajor_t
 bdevsw_lookup_major(const struct bdevsw *bdev)
 {
-	devmajor_t bmajor;
+	const struct bdevsw *const *curbdevsw;
+	devmajor_t bmajor, bmax;
 
-	for (bmajor = 0 ; bmajor < max_bdevsws ; bmajor++) {
-		if (bdevsw[bmajor] == bdev)
+	bmax = atomic_load_acquire(&max_bdevsws);
+	curbdevsw = atomic_load_consume(&bdevsw);
+	for (bmajor = 0; bmajor < bmax; bmajor++) {
+		if (atomic_load_relaxed(&curbdevsw[bmajor]) == bdev)
 			return (bmajor);
 	}
 
@@ -419,10 +616,13 @@ bdevsw_lookup_major(const struct bdevsw *bdev)
 devmajor_t
 cdevsw_lookup_major(const struct cdevsw *cdev)
 {
-	devmajor_t cmajor;
+	const struct cdevsw *const *curcdevsw;
+	devmajor_t cmajor, cmax;
 
-	for (cmajor = 0 ; cmajor < max_cdevsws ; cmajor++) {
-		if (cdevsw[cmajor] == cdev)
+	cmax = atomic_load_acquire(&max_cdevsws);
+	curcdevsw = atomic_load_consume(&cdevsw);
+	for (cmajor = 0; cmajor < cmax; cmajor++) {
+		if (atomic_load_relaxed(&curcdevsw[cmajor]) == cdev)
 			return (cmajor);
 	}
 
@@ -697,22 +897,41 @@ int
 bdev_open(dev_t dev, int flag, int devtype, lwp_t *l)
 {
 	const struct bdevsw *d;
-	int rv, mpflag;
+	struct localcount *lc;
+	device_t dv = NULL/*XXXGCC*/;
+	int unit, rv, mpflag;
 
-	/*
-	 * For open we need to lock, in order to synchronize
-	 * with attach/detach.
-	 */
-	mutex_enter(&device_lock);
-	d = bdevsw_lookup(dev);
-	mutex_exit(&device_lock);
+	d = bdevsw_lookup_acquire(dev, &lc);
 	if (d == NULL)
 		return ENXIO;
 
+	if (d->d_devtounit) {
+		/*
+		 * If the device node corresponds to an autoconf device
+		 * instance, acquire a reference to it so that during
+		 * d_open, device_lookup is stable.
+		 *
+		 * XXX This should also arrange to instantiate cloning
+		 * pseudo-devices if appropriate, but that requires
+		 * reviewing them all to find and verify a common
+		 * pattern.
+		 */
+		if ((unit = (*d->d_devtounit)(dev)) == -1)
+			return ENXIO;
+		if ((dv = device_lookup_acquire(d->d_cfdriver, unit)) == NULL)
+			return ENXIO;
+	}
+
 	DEV_LOCK(d);
 	rv = (*d->d_open)(dev, flag, devtype, l);
 	DEV_UNLOCK(d);
 
+	if (d->d_devtounit) {
+		device_release(dv);
+	}
+
+	bdevsw_release(d, lc);
+
 	return rv;
 }
 
@@ -855,22 +1074,41 @@ int
 cdev_open(dev_t dev, int flag, int devtype, lwp_t *l)
 {
 	const struct cdevsw *d;
-	int rv, mpflag;
+	struct localcount *lc;
+	device_t dv = NULL/*XXXGCC*/;
+	int unit, rv, mpflag;
 
-	/*
-	 * For open we need to lock, in order to synchronize
-	 * with attach/detach.
-	 */
-	mutex_enter(&device_lock);
-	d = cdevsw_lookup(dev);
-	mutex_exit(&device_lock);
+	d = cdevsw_lookup_acquire(dev, &lc);
 	if (d == NULL)
 		return ENXIO;
 
+	if (d->d_devtounit) {
+		/*
+		 * If the device node corresponds to an autoconf device
+		 * instance, acquire a reference to it so that during
+		 * d_open, device_lookup is stable.
+		 *
+		 * XXX This should also arrange to instantiate cloning
+		 * pseudo-devices if appropriate, but that requires
+		 * reviewing them all to find and verify a common
+		 * pattern.
+		 */
+		if ((unit = (*d->d_devtounit)(dev)) == -1)
+			return ENXIO;
+		if ((dv = device_lookup_acquire(d->d_cfdriver, unit)) == NULL)
+			return ENXIO;
+	}
+
 	DEV_LOCK(d);
 	rv = (*d->d_open)(dev, flag, devtype, l);
 	DEV_UNLOCK(d);
 
+	if (d->d_devtounit) {
+		device_release(dv);
+	}
+
+	cdevsw_release(d, lc);
+
 	return rv;
 }
 
@@ -1063,3 +1301,18 @@ nommap(dev_t dev, off_t off, int prot)
 
 	return (paddr_t)-1;
 }
+
+/*
+ * dev_minor_unit(dev)
+ *
+ *	Returns minor(dev) as an int.  Intended for use with struct
+ *	bdevsw, cdevsw::d_devtounit for drivers whose /dev nodes are
+ *	implemented by reference to an autoconf instance with the minor
+ *	number.
+ */
+int
+dev_minor_unit(dev_t dev)
+{
+
+	return minor(dev);
+}
diff --git a/sys/kern/subr_disk.c b/sys/kern/subr_disk.c
index da664f920382..41218421db57 100644
--- a/sys/kern/subr_disk.c
+++ b/sys/kern/subr_disk.c
@@ -728,3 +728,10 @@ disk_set_info(device_t dev, struct disk *dk, const char *type)
 	if (odisk_info)
 		prop_object_release(odisk_info);
 }
+
+int
+disklabel_dev_unit(dev_t dev)
+{
+
+	return DISKUNIT(dev);
+}
diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index b4bc4c34ab03..987cd5969d95 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -81,10 +81,19 @@ __KERNEL_RCSID(0, "$NetBSD: spec_vnops.c,v 1.183 2021/07/18 23:57:14 dholland Ex
 #include <sys/kauth.h>
 #include <sys/fstrans.h>
 #include <sys/module.h>
+#include <sys/atomic.h>
 
 #include <miscfs/genfs/genfs.h>
 #include <miscfs/specfs/specdev.h>
 
+/*
+ * Lock order:
+ *
+ *	vnode lock
+ *	-> device_lock
+ *	-> struct vnode::v_interlock
+ */
+
 /* symbolic sleep message strings for devices */
 const char	devopn[] = "devopn";
 const char	devio[] = "devio";
@@ -165,6 +174,7 @@ const struct vnodeopv_desc spec_vnodeop_opv_desc =
 	{ &spec_vnodeop_p, spec_vnodeop_entries };
 
 static kauth_listener_t rawio_listener;
+static struct kcondvar specfs_iocv;
 
 /* Returns true if vnode is /dev/mem or /dev/kmem. */
 bool
@@ -210,6 +220,123 @@ spec_init(void)
 
 	rawio_listener = kauth_listen_scope(KAUTH_SCOPE_DEVICE,
 	    rawio_listener_cb, NULL);
+	cv_init(&specfs_iocv, "specio");
+}
+
+/*
+ * spec_io_enter(vp, &sn, &dev)
+ *
+ *	Enter an operation that may not hold vp's vnode lock or an
+ *	fstrans on vp's mount.  Until spec_io_exit, the vnode will not
+ *	be revoked.
+ *
+ *	On success, set sn to the specnode pointer and dev to the dev_t
+ *	number and return zero.  Caller must later call spec_io_exit
+ *	when done.
+ *
+ *	On failure, return ENXIO -- the device has been revoked and no
+ *	longer exists.
+ */
+static int
+spec_io_enter(struct vnode *vp, struct specnode **snp, dev_t *devp)
+{
+	dev_t dev;
+	struct specnode *sn;
+	unsigned iocnt;
+	int error = 0;
+
+	mutex_enter(vp->v_interlock);
+
+	/*
+	 * Extract all the info we need from the vnode, unless the
+	 * vnode has already been reclaimed.  This can happen if the
+	 * underlying device has been removed and all the device nodes
+	 * for it have been revoked.  The caller may not hold a vnode
+	 * lock or fstrans to prevent this from happening before it has
+	 * had an opportunity to notice the vnode is dead.
+	 */
+	if (vdead_check(vp, VDEAD_NOWAIT) != 0 ||
+	    (sn = vp->v_specnode) == NULL ||
+	    (dev = vp->v_rdev) == NODEV) {
+		error = ENXIO;
+		goto out;
+	}
+
+	/*
+	 * Notify spec_node_revoke that we are doing an I/O operation
+	 * which may not be not bracketed by fstrans(9) and thus is not
+	 * blocked by vfs suspension.
+	 *
+	 * We could hold this reference with psref(9) instead, but we
+	 * already have to take the interlock for vdead_check, so
+	 * there's not much more cost here to another atomic operation.
+	 */
+	iocnt = atomic_inc_uint_nv(&sn->sn_iocnt);
+	CTASSERT(MAXLWP < UINT_MAX);
+	KASSERT(iocnt < UINT_MAX);
+
+	/* Success!  */
+	*snp = sn;
+	*devp = dev;
+	error = 0;
+
+out:	mutex_exit(vp->v_interlock);
+	return error;
+}
+
+/*
+ * spec_io_exit(vp, sn)
+ *
+ *	Exit an operation entered with a successful spec_io_enter --
+ *	allow concurrent spec_node_revoke to proceed.  The argument sn
+ *	must match the struct specnode pointer returned by spec_io_exit
+ *	for vp.
+ */
+static void
+spec_io_exit(struct vnode *vp, struct specnode *sn)
+{
+	unsigned iocnt;
+
+	KASSERT(vp->v_specnode == sn);
+
+	/*
+	 * We are done.  Notify spec_node_revoke if appropriate.  The
+	 * transition of 1 -> 0 must happen under device_lock so
+	 * spec_node_revoke doesn't miss a wakeup.
+	 */
+	do {
+		iocnt = atomic_load_relaxed(&sn->sn_iocnt);
+		if (iocnt == 1) {
+			mutex_enter(&device_lock);
+			if (atomic_dec_uint_nv(&sn->sn_iocnt) == 0)
+				cv_broadcast(&specfs_iocv);
+			mutex_exit(&device_lock);
+			break;
+		}
+	} while (atomic_cas_uint(&sn->sn_iocnt, iocnt, iocnt - 1) != iocnt);
+}
+
+/*
+ * spec_io_drain(vp, sn)
+ *
+ *	Wait for all existing spec_io_enter/exit sections to complete.
+ *	Caller must ensure spec_io_enter will fail at this point.
+ */
+static void
+spec_io_drain(struct vnode *vp, struct specnode *sn)
+{
+
+	/*
+	 * I/O at the same time as closing is unlikely -- it often
+	 * indicates an application bug.
+	 */
+	if (__predict_true(atomic_load_relaxed(&sn->sn_iocnt) == 0))
+		return;
+
+	mutex_enter(&device_lock);
+	while (atomic_load_relaxed(&sn->sn_iocnt) != 0)
+		cv_wait(&specfs_iocv, &device_lock);
+	mutex_exit(&device_lock);
 }
 
 /*
@@ -250,6 +377,7 @@ spec_node_init(vnode_t *vp, dev_t rdev)
 		sd->sd_refcnt = 1;
 		sd->sd_opencnt = 0;
 		sd->sd_bdevvp = NULL;
+		sd->sd_opened = false;
 		sn->sn_dev = sd;
 		sd = NULL;
 	} else {
@@ -261,6 +389,7 @@ spec_node_init(vnode_t *vp, dev_t rdev)
 	sn->sn_opencnt = 0;
 	sn->sn_rdev = rdev;
 	sn->sn_gone = false;
+	sn->sn_iocnt = 0;
 	vp->v_specnode = sn;
 	vp->v_specnext = *vpp;
 	*vpp = vp;
@@ -373,6 +502,8 @@ spec_node_setmountedfs(vnode_t *devvp, struct mount *mp)
 
 	KASSERT(devvp->v_type == VBLK);
 	KASSERT(devvp->v_specnode->sn_dev->sd_mountpoint == NULL || mp == NULL);
+	KASSERT(devvp->v_specnode->sn_opencnt);
+
 	devvp->v_specnode->sn_dev->sd_mountpoint = mp;
 	if (mp == NULL)
 		return;
@@ -415,6 +546,14 @@ spec_node_revoke(vnode_t *vp)
 		KASSERT(sn->sn_opencnt == 0);
 	}
 	mutex_exit(&device_lock);
+
+	/*
+	 * Wait for all other devsw operations to drain.  After this
+	 * point, no bdev/cdev_* can be active for this specnode.
+	 * Note: We drain even if we witnessed sn->sn_opencnt == 0,
+	 * because there may have been a concurrent VOP_CLOSE.
+	 */
+	spec_io_drain(vp, sn);
 }
 
 /*
@@ -435,6 +574,7 @@ spec_node_destroy(vnode_t *vp)
 	KASSERT(vp->v_type == VBLK || vp->v_type == VCHR);
 	KASSERT(vp->v_specnode != NULL);
 	KASSERT(sn->sn_opencnt == 0);
+	KASSERT(sn->sn_iocnt == 0);
 
 	mutex_enter(&device_lock);
 	/* Remove from the hash and destroy the node. */
@@ -498,26 +638,28 @@ spec_open(void *v)
 		int  a_mode;
 		kauth_cred_t a_cred;
 	} */ *ap = v;
-	struct lwp *l;
-	struct vnode *vp;
-	dev_t dev;
+	struct lwp *l = curlwp;
+	struct vnode *vp = ap->a_vp;
+	dev_t dev, dev1;
 	int error;
 	enum kauth_device_req req;
-	specnode_t *sn;
+	specnode_t *sn, *sn1;
 	specdev_t *sd;
 	spec_ioctl_t ioctl;
-	u_int gen;
-	const char *name;
+	u_int gen = 0;
+	const char *name = NULL;
+	bool needclose = false;
 	struct partinfo pi;
-	
-	l = curlwp;
-	vp = ap->a_vp;
+
+	KASSERT(VOP_ISLOCKED(vp) == LK_EXCLUSIVE ||
+	    (vp->v_vflag & VV_LOCKSWORK) == 0);
+	KASSERTMSG(vp->v_type == VBLK || vp->v_type == VCHR, "type=%d",
+	    vp->v_type);
+
 	dev = vp->v_rdev;
 	sn = vp->v_specnode;
 	sd = sn->sn_dev;
-	name = NULL;
-	gen = 0;
-	
+
 	/*
 	 * Don't allow open if fs is mounted -nodev.
 	 */
@@ -535,28 +677,101 @@ spec_open(void *v)
 		req = KAUTH_REQ_DEVICE_RAWIO_SPEC_READ;
 		break;
 	}
+	error = kauth_authorize_device_spec(ap->a_cred, req, vp);
+	if (error != 0)
+		return (error);
 
+	/*
+	 * Acquire an open reference -- as long as we hold onto it, and
+	 * the vnode isn't revoked, it can't be closed.
+	 */
+	mutex_enter(&device_lock);
+	KASSERT(!sn->sn_gone);
 	switch (vp->v_type) {
 	case VCHR:
-		error = kauth_authorize_device_spec(ap->a_cred, req, vp);
-		if (error != 0)
-			return (error);
-
 		/*
 		 * Character devices can accept opens from multiple
 		 * vnodes.
 		 */
-		mutex_enter(&device_lock);
-		if (sn->sn_gone) {
-			mutex_exit(&device_lock);
-			return (EBADF);
-		}
 		sd->sd_opencnt++;
 		sn->sn_opencnt++;
-		mutex_exit(&device_lock);
+		break;
+	case VBLK:
+		/*
+		 * For block devices, permit only one open.  The buffer
+		 * cache cannot remain self-consistent with multiple
+		 * vnodes holding a block device open.
+		 *
+		 * Treat zero opencnt with non-NULL mountpoint as open.
+		 * This may happen after forced detach of a mounted device.
+		 */
+		if (sd->sd_opencnt != 0 || sd->sd_mountpoint != NULL) {
+			error = EBUSY;
+			break;
+		}
+		KASSERTMSG(sn->sn_opencnt == 0, "%u", sn->sn_opencnt);
+		sn->sn_opencnt = 1;
+		sd->sd_opencnt = 1;
+		sd->sd_bdevvp = vp;
+		break;
+	default:
+		panic("invalid specfs vnode type: %d", vp->v_type);
+	}
+	mutex_exit(&device_lock);
+	if (error)
+		return error;
+
+	/*
+	 * Set VV_ISTTY if this is a tty cdev.
+	 *
+	 * XXX This does the wrong thing if the module has to be
+	 * autoloaded.  We should maybe set this after autoloading
+	 * modules and calling .d_open successfully, except (a) we need
+	 * the vnode lock to touch it, and (b) once we acquire the
+	 * vnode lock again, the vnode may have been revoked, and
+	 * deadfs's dead_read needs VV_ISTTY to be already set in order
+	 * to return the right answer.  So this needs some additional
+	 * synchronization to be made to work correctly with tty driver
+	 * module autoload.  For now, let's just hope it doesn't cause
+	 * too much trouble for a tty from an autoloaded driver module
+	 * to fail with EIO instead of returning EOF.
+	 */
+	if (vp->v_type == VCHR) {
 		if (cdev_type(dev) == D_TTY)
 			vp->v_vflag |= VV_ISTTY;
-		VOP_UNLOCK(vp);
+	}
+
+	/*
+	 * Because the opening the device may block indefinitely,
+	 * e.g. when opening a tty, and loading a module may cross into
+	 * many other subsystems, we must not hold the vnode lock while
+	 * calling .d_open, so release it now and reacquire it when
+	 * done.
+	 *
+	 * Take an I/O reference so that any concurrent
+	 * spec_node_revoke will wait for us to finish calling .d_open.
+	 * The vnode can't be dead at this point because we have it
+	 * locked.  Note that if revoked, the driver will interrupt
+	 * .d_open before spec_node_revoke starts waiting for I/O to
+	 * drain so this doesn't deadlock.
+	 */
+	VOP_UNLOCK(vp);
+	error = spec_io_enter(vp, &sn1, &dev1);
+	if (error) {
+		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
+		return error;
+	}
+	KASSERT(sn1 == sn);
+	KASSERT(dev1 == dev);
+
+	/*
+	 * Open the device.  If .d_open returns ENXIO (device not
+	 * configured), the driver may not be loaded, so try
+	 * autoloading a module and then try .d_open again if anything
+	 * got loaded.
+	 */
+	switch (vp->v_type) {
+	case VCHR:
 		do {
 			const struct cdevsw *cdev;
 
@@ -579,36 +794,9 @@ spec_open(void *v)
 			/* Try to autoload device module */
 			(void) module_autoload(name, MODULE_CLASS_DRIVER);
 		} while (gen != module_gen);
-
-		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 		break;
 
 	case VBLK:
-		error = kauth_authorize_device_spec(ap->a_cred, req, vp);
-		if (error != 0)
-			return (error);
-
-		/*
-		 * For block devices, permit only one open.  The buffer
-		 * cache cannot remain self-consistent with multiple
-		 * vnodes holding a block device open.
-		 *
-		 * Treat zero opencnt with non-NULL mountpoint as open.
-		 * This may happen after forced detach of a mounted device.
-		 */
-		mutex_enter(&device_lock);
-		if (sn->sn_gone) {
-			mutex_exit(&device_lock);
-			return (EBADF);
-		}
-		if (sd->sd_opencnt != 0 || sd->sd_mountpoint != NULL) {
-			mutex_exit(&device_lock);
-			return EBUSY;
-		}
-		sn->sn_opencnt = 1;
-		sd->sd_opencnt = 1;
-		sd->sd_bdevvp = vp;
-		mutex_exit(&device_lock);
 		do {
 			const struct bdevsw *bdev;
 
@@ -628,49 +816,118 @@ spec_open(void *v)
 			if ((name = bdevsw_getname(major(dev))) == NULL)
 				break;
 
-			VOP_UNLOCK(vp);
-
                         /* Try to autoload device module */
 			(void) module_autoload(name, MODULE_CLASS_DRIVER);
-			
-			vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 		} while (gen != module_gen);
-
 		break;
 
-	case VNON:
-	case VLNK:
-	case VDIR:
-	case VREG:
-	case VBAD:
-	case VFIFO:
-	case VSOCK:
 	default:
-		return 0;
+		__unreachable();
 	}
 
+	/*
+	 * Release the I/O reference now that we have called .d_open,
+	 * and reacquire the vnode lock.  At this point, the device may
+	 * have been revoked, so we must tread carefully -- can't touch
+	 * sn or sd until we verify the vnode is not dead.
+	 */
+	spec_io_exit(vp, sn);
+	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
+
+	/*
+	 * If it has been revoked since we released the vnode lock and
+	 * reacquired it, then spec_node_revoke has closed it, and we
+	 * must fail with EBADF without touching sn or sd which may be
+	 * freed at this point.
+	 *
+	 * Otherwise, if opening it failed, back out and release the
+	 * open reference.  If it was ever successfully opened and we
+	 * got the last reference this way, it's now our job to close
+	 * it.  This might happen in the following scenario:
+	 *
+	 *	Thread 1		Thread 2
+	 *	VOP_OPEN
+	 *	  ...
+	 *	  .d_open -> 0 (success)
+	 *	  acquire vnode lock
+	 *	  do stuff		VOP_OPEN
+	 *	  release vnode lock	...
+	 *				  .d_open -> EBUSY
+	 *	VOP_CLOSE
+	 *	  acquire vnode lock
+	 *	  --sd_opencnt != 0
+	 *	  => no .d_close
+	 *	  release vnode lock
+	 *				  acquire vnode lock
+	 *				  --sd_opencnt == 0
+	 *
+	 * We can't resolve this by making spec_close wait for .d_open
+	 * to complete before examining sd_opencnt, because .d_open can
+	 * hang indefinitely, e.g. for a tty.
+	 */
 	mutex_enter(&device_lock);
-	if (sn->sn_gone) {
+	mutex_enter(vp->v_interlock);
+	if (vdead_check(vp, VDEAD_NOWAIT) != 0) {
 		if (error == 0)
 			error = EBADF;
-	} else if (error != 0) {
-		sd->sd_opencnt--;
-		sn->sn_opencnt--;
-		if (vp->v_type == VBLK)
-			sd->sd_bdevvp = NULL;
-
+	} else {
+		KASSERT(!sn->sn_gone);
+		if (error != 0) {
+			if (--sd->sd_opencnt == 0 && sd->sd_opened) {
+				needclose = true;
+				sd->sd_opened = false;
+			}
+			sn->sn_opencnt--;
+			if (vp->v_type == VBLK)
+				sd->sd_bdevvp = NULL;
+		} else {
+			sd->sd_opened = true;
+		}
 	}
+	mutex_exit(vp->v_interlock);
 	mutex_exit(&device_lock);
 
-	if (cdev_type(dev) != D_DISK || error != 0)
+	/*
+	 * If this open failed, but the device was previously opened,
+	 * and another thread concurrently closed the vnode while we
+	 * were in the middle of reopening it, the other thread will
+	 * see sd_opencnt > 0 and thus decide not to call .d_close --
+	 * it is now our responsibility to do so.
+	 *
+	 * XXX The flags passed to VOP_CLOSE here are wrong, but
+	 * drivers can't rely on FREAD|FWRITE anyway -- e.g., consider
+	 * a device opened by thread 0 with O_READ, then opened by
+	 * thread 1 with O_WRITE, then closed by thread 0, and finally
+	 * closed by thread 1; the last .d_close call will have FWRITE
+	 * but not FREAD.  We should just eliminate the FREAD/FWRITE
+	 * parameter to .d_close altogether.
+	 */
+	if (needclose) {
+		KASSERT(error);
+		VOP_CLOSE(vp, FNONBLOCK, NOCRED);
+	}
+
+	/* If anything went wrong, we're done.  */
+	if (error)
 		return error;
 
-	
-	ioctl = vp->v_type == VCHR ? cdev_ioctl : bdev_ioctl;
-	error = (*ioctl)(vp->v_rdev, DIOCGPARTINFO, &pi, FREAD, curlwp);
-	if (error == 0)
-		uvm_vnp_setsize(vp, (voff_t)pi.pi_secsize * pi.pi_size);
+	/*
+	 * For disk devices, automagically set the vnode size to the
+	 * partition size, if we can.  This applies to block devices
+	 * and character devices alike -- every block device must have
+	 * a corresponding character device.  And if the module is
+	 * loaded it will remain loaded until we're done here (it is
+	 * forbidden to devsw_detach until closed).  So it is safe to
+	 * query cdev_type unconditionally here.
+	 */
+	if (cdev_type(dev) == D_DISK) {
+		ioctl = vp->v_type == VCHR ? cdev_ioctl : bdev_ioctl;
+		if ((*ioctl)(dev, DIOCGPARTINFO, &pi, FREAD, curlwp) == 0)
+			uvm_vnp_setsize(vp,
+			    (voff_t)pi.pi_secsize * pi.pi_size);
+	}
 
+	/* Success!  */
 	return 0;
 }
 
@@ -690,6 +947,8 @@ spec_read(void *v)
 	struct vnode *vp = ap->a_vp;
 	struct uio *uio = ap->a_uio;
  	struct lwp *l = curlwp;
+	struct specnode *sn;
+	dev_t dev;
 	struct buf *bp;
 	daddr_t bn;
 	int bsize, bscale;
@@ -712,9 +971,27 @@ spec_read(void *v)
 	switch (vp->v_type) {
 
 	case VCHR:
+		/*
+		 * Release the lock while we sleep -- possibly
+		 * indefinitely, if this is, e.g., a tty -- in
+		 * cdev_read, so we don't hold up everything else that
+		 * might want access to the vnode.
+		 *
+		 * But before we issue the read, take an I/O reference
+		 * to the specnode so close will know when we're done
+		 * writing.  Note that the moment we release the lock,
+		 * the vnode's identity may change; hence spec_io_enter
+		 * may fail, and the caller may have a dead vnode on
+		 * their hands, if the file system on which vp lived
+		 * has been unmounted.
+		 */
 		VOP_UNLOCK(vp);
-		error = cdev_read(vp->v_rdev, uio, ap->a_ioflag);
-		vn_lock(vp, LK_SHARED | LK_RETRY);
+		error = spec_io_enter(vp, &sn, &dev);
+		if (error)
+			goto out;
+		error = cdev_read(dev, uio, ap->a_ioflag);
+		spec_io_exit(vp, sn);
+out:		vn_lock(vp, LK_SHARED | LK_RETRY);
 		return (error);
 
 	case VBLK:
@@ -791,6 +1068,8 @@ spec_write(void *v)
 	struct vnode *vp = ap->a_vp;
 	struct uio *uio = ap->a_uio;
 	struct lwp *l = curlwp;
+	struct specnode *sn;
+	dev_t dev;
 	struct buf *bp;
 	daddr_t bn;
 	int bsize, bscale;
@@ -806,9 +1085,27 @@ spec_write(void *v)
 	switch (vp->v_type) {
 
 	case VCHR:
+		/*
+		 * Release the lock while we sleep -- possibly
+		 * indefinitely, if this is, e.g., a tty -- in
+		 * cdev_write, so we don't hold up everything else that
+		 * might want access to the vnode.
+		 *
+		 * But before we issue the write, take an I/O reference
+		 * to the specnode so close will know when we're done
+		 * writing.  Note that the moment we release the lock,
+		 * the vnode's identity may change; hence spec_io_enter
+		 * may fail, and the caller may have a dead vnode on
+		 * their hands, if the file system on which vp lived
+		 * has been unmounted.
+		 */
 		VOP_UNLOCK(vp);
-		error = cdev_write(vp->v_rdev, uio, ap->a_ioflag);
-		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
+		error = spec_io_enter(vp, &sn, &dev);
+		if (error)
+			goto out;
+		error = cdev_write(dev, uio, ap->a_ioflag);
+		spec_io_exit(vp, sn);
+out:		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 		return (error);
 
 	case VBLK:
@@ -866,21 +1163,13 @@ spec_fdiscard(void *v)
 		off_t a_pos;
 		off_t a_len;
 	} */ *ap = v;
-	struct vnode *vp;
+	struct vnode *vp = ap->a_vp;
 	dev_t dev;
 
-	vp = ap->a_vp;
-	dev = NODEV;
-
-	mutex_enter(vp->v_interlock);
-	if (vdead_check(vp, VDEAD_NOWAIT) == 0 && vp->v_specnode != NULL) {
-		dev = vp->v_rdev;
-	}
-	mutex_exit(vp->v_interlock);
+	KASSERT(VOP_ISLOCKED(vp) == LK_EXCLUSIVE ||
+	    (vp->v_vflag & VV_LOCKSWORK) == 0);
 
-	if (dev == NODEV) {
-		return ENXIO;
-	}
+	dev = vp->v_rdev;
 
 	switch (vp->v_type) {
 	    case VCHR:
@@ -909,40 +1198,32 @@ spec_ioctl(void *v)
 		int  a_fflag;
 		kauth_cred_t a_cred;
 	} */ *ap = v;
-	struct vnode *vp;
+	struct vnode *vp = ap->a_vp;
+	struct specnode *sn;
 	dev_t dev;
+	int error;
 
-	/*
-	 * Extract all the info we need from the vnode, taking care to
-	 * avoid a race with VOP_REVOKE().
-	 */
-
-	vp = ap->a_vp;
-	dev = NODEV;
-	mutex_enter(vp->v_interlock);
-	if (vdead_check(vp, VDEAD_NOWAIT) == 0 && vp->v_specnode) {
-		dev = vp->v_rdev;
-	}
-	mutex_exit(vp->v_interlock);
-	if (dev == NODEV) {
-		return ENXIO;
-	}
+	error = spec_io_enter(vp, &sn, &dev);
+	if (error)
+		return error;
 
 	switch (vp->v_type) {
-
 	case VCHR:
-		return cdev_ioctl(dev, ap->a_command, ap->a_data,
+		error = cdev_ioctl(dev, ap->a_command, ap->a_data,
 		    ap->a_fflag, curlwp);
-
+		break;
 	case VBLK:
 		KASSERT(vp == vp->v_specnode->sn_dev->sd_bdevvp);
-		return bdev_ioctl(dev, ap->a_command, ap->a_data,
+		error = bdev_ioctl(dev, ap->a_command, ap->a_data,
 		   ap->a_fflag, curlwp);
-
+		break;
 	default:
 		panic("spec_ioctl");
 		/* NOTREACHED */
 	}
+
+	spec_io_exit(vp, sn);
+	return error;
 }
 
 /* ARGSUSED */
@@ -953,33 +1234,25 @@ spec_poll(void *v)
 		struct vnode *a_vp;
 		int a_events;
 	} */ *ap = v;
-	struct vnode *vp;
+	struct vnode *vp = ap->a_vp;
+	struct specnode *sn;
 	dev_t dev;
+	int revents;
 
-	/*
-	 * Extract all the info we need from the vnode, taking care to
-	 * avoid a race with VOP_REVOKE().
-	 */
-
-	vp = ap->a_vp;
-	dev = NODEV;
-	mutex_enter(vp->v_interlock);
-	if (vdead_check(vp, VDEAD_NOWAIT) == 0 && vp->v_specnode) {
-		dev = vp->v_rdev;
-	}
-	mutex_exit(vp->v_interlock);
-	if (dev == NODEV) {
+	if (spec_io_enter(vp, &sn, &dev) != 0)
 		return POLLERR;
-	}
 
 	switch (vp->v_type) {
-
 	case VCHR:
-		return cdev_poll(dev, ap->a_events, curlwp);
-
+		revents = cdev_poll(dev, ap->a_events, curlwp);
+		break;
 	default:
-		return (genfs_poll(v));
+		revents = genfs_poll(v);
+		break;
 	}
+
+	spec_io_exit(vp, sn);
+	return revents;
 }
 
 /* ARGSUSED */
@@ -990,20 +1263,30 @@ spec_kqfilter(void *v)
 		struct vnode	*a_vp;
 		struct proc	*a_kn;
 	} */ *ap = v;
+	struct vnode *vp = ap->a_vp;
+	struct specnode *sn;
 	dev_t dev;
+	int error;
 
-	switch (ap->a_vp->v_type) {
+	error = spec_io_enter(vp, &sn, &dev);
+	if (error)
+		return error;
 
+	switch (vp->v_type) {
 	case VCHR:
-		dev = ap->a_vp->v_rdev;
-		return cdev_kqfilter(dev, ap->a_kn);
+		error = cdev_kqfilter(dev, ap->a_kn);
+		break;
 	default:
 		/*
 		 * Block devices don't support kqfilter, and refuse it
 		 * for any other files (like those vflush()ed) too.
 		 */
-		return (EOPNOTSUPP);
+		error = EOPNOTSUPP;
+		break;
 	}
+
+	spec_io_exit(vp, sn);
+	return error;
 }
 
 /*
@@ -1018,11 +1301,19 @@ spec_mmap(void *v)
 		kauth_cred_t a_cred;
 	} */ *ap = v;
 	struct vnode *vp = ap->a_vp;
+	struct specnode *sn;
+	dev_t dev;
+	int error;
 
 	KASSERT(vp->v_type == VBLK);
-	if (bdev_type(vp->v_rdev) != D_DISK)
-		return EINVAL;
 
+	error = spec_io_enter(vp, &sn, &dev);
+	if (error)
+		return error;
+
+	error = bdev_type(dev) == D_DISK ? 0 : EINVAL;
+
+	spec_io_exit(vp, sn);
 	return 0;
 }
 
@@ -1067,27 +1358,14 @@ spec_strategy(void *v)
 	} */ *ap = v;
 	struct vnode *vp = ap->a_vp;
 	struct buf *bp = ap->a_bp;
+	struct specnode *sn = NULL;
 	dev_t dev;
 	int error;
 
-	dev = NODEV;
-
-	/*
-	 * Extract all the info we need from the vnode, taking care to
-	 * avoid a race with VOP_REVOKE().
-	 */
-
-	mutex_enter(vp->v_interlock);
-	if (vdead_check(vp, VDEAD_NOWAIT) == 0 && vp->v_specnode != NULL) {
-		KASSERT(vp == vp->v_specnode->sn_dev->sd_bdevvp);
-		dev = vp->v_rdev;
-	}
-	mutex_exit(vp->v_interlock);
-
-	if (dev == NODEV) {
-		error = ENXIO;
+	error = spec_io_enter(vp, &sn, &dev);
+	if (error)
 		goto out;
-	}
+
 	bp->b_dev = dev;
 
 	if (!(bp->b_flags & B_READ)) {
@@ -1107,13 +1385,15 @@ spec_strategy(void *v)
 	}
 	bdev_strategy(bp);
 
-	return 0;
-
-out:
-	bp->b_error = error;
-	bp->b_resid = bp->b_bcount;
-	biodone(bp);
+	error = 0;
 
+out:	if (sn)
+		spec_io_exit(vp, sn);
+	if (error) {
+		bp->b_error = error;
+		bp->b_resid = bp->b_bcount;
+		biodone(bp);
+	}
 	return error;
 }
 
@@ -1139,6 +1419,9 @@ spec_reclaim(void *v)
 	} */ *ap = v;
 	struct vnode *vp = ap->a_vp;
 
+	KASSERT(vp->v_specnode->sn_opencnt == 0);
+	KASSERT(vp->v_specnode->sn_iocnt == 0);
+
 	VOP_UNLOCK(vp);
 
 	KASSERT(vp->v_mount == dead_rootmount);
@@ -1182,14 +1465,18 @@ spec_close(void *v)
 	} */ *ap = v;
 	struct vnode *vp = ap->a_vp;
 	struct session *sess;
-	dev_t dev = vp->v_rdev;
+	dev_t dev, dev1;
 	int flags = ap->a_fflag;
 	int mode, error, count;
-	specnode_t *sn;
+	specnode_t *sn, *sn1;
 	specdev_t *sd;
 
+	KASSERT(VOP_ISLOCKED(vp) == LK_EXCLUSIVE ||
+	    (vp->v_vflag & VV_LOCKSWORK) == 0);
+
 	mutex_enter(vp->v_interlock);
 	sn = vp->v_specnode;
+	dev = vp->v_rdev;
 	sd = sn->sn_dev;
 	/*
 	 * If we're going away soon, make this non-blocking.
@@ -1269,14 +1556,44 @@ spec_close(void *v)
 		panic("spec_close: not special");
 	}
 
+	/*
+	 * Decrement the open reference count of this node and the
+	 * device.  For block devices, the open reference count must be
+	 * 1 at this point.  If the device's open reference count goes
+	 * to zero, we're the last one out so get the lights.
+	 *
+	 * We may find --sd->sd_opencnt gives zero, and yet
+	 * sd->sd_opened is false.  This happens if the vnode is
+	 * revoked at the same time as it is being opened, which can
+	 * happen when opening a tty blocks indefinitely.  In that
+	 * case, we still must call close -- it is the job of close to
+	 * interrupt the open.  Either way, the device will be no
+	 * longer opened, so we have to clear sd->sd_opened; subsequent
+	 * opens will have responsibility for issuing close.
+	 *
+	 * This has the side effect that the sequence of opens might
+	 * happen out of order -- we might end up doing open, open,
+	 * close, close, instead of open, close, open, close.  This is
+	 * unavoidable with the current devsw API, where open is
+	 * allowed to block and close must be able to run concurrently
+	 * to interrupt it.  It is the driver's responsibility to
+	 * ensure that close is idempotent so that this works.  Drivers
+	 * requiring per-open state and exact 1:1 correspondence
+	 * between open and close can use fd_clone.
+	 */
 	mutex_enter(&device_lock);
 	sn->sn_opencnt--;
 	count = --sd->sd_opencnt;
-	if (vp->v_type == VBLK)
+	if (vp->v_type == VBLK) {
+		KASSERTMSG(count == 0, "block device with %u opens",
+		    count + 1);
 		sd->sd_bdevvp = NULL;
+	}
+	if (count == 0)
+		sd->sd_opened = false;
 	mutex_exit(&device_lock);
 
-	if (count != 0 && (vp->v_type != VCHR || !(cdev_flags(dev) & D_MCLOSE)))
+	if (count != 0)
 		return 0;
 
 	/*
@@ -1284,16 +1601,28 @@ spec_close(void *v)
 	 * might end up sleeping for someone else who wants our queues. They
 	 * won't get them if we hold the vnode locked.
 	 */
-	if (!(flags & FNONBLOCK))
+	if (!(flags & FNONBLOCK)) {
+		/*
+		 * Take an I/O reference while we hold the vnode lock.
+		 * This ensures that spec_node_revoke waits for the
+		 * last close.
+		 */
+		error = spec_io_enter(vp, &sn1, &dev1);
+		KASSERTMSG(error == 0, "error=%d", error);
+		KASSERT(sn1 == sn);
+		KASSERT(dev1 == dev);
 		VOP_UNLOCK(vp);
+	}
 
 	if (vp->v_type == VBLK)
 		error = bdev_close(dev, flags, mode, curlwp);
 	else
 		error = cdev_close(dev, flags, mode, curlwp);
 
-	if (!(flags & FNONBLOCK))
+	if (!(flags & FNONBLOCK)) {
+		spec_io_exit(vp, sn1);
 		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
+	}
 
 	return (error);
 }
diff --git a/sys/miscfs/specfs/specdev.h b/sys/miscfs/specfs/specdev.h
index 621c103e80b1..1abb52964b50 100644
--- a/sys/miscfs/specfs/specdev.h
+++ b/sys/miscfs/specfs/specdev.h
@@ -69,6 +69,7 @@ typedef struct specnode {
 	u_int		sn_opencnt;
 	dev_t		sn_rdev;
 	bool		sn_gone;
+	volatile u_int	sn_iocnt;
 } specnode_t;
 
 typedef struct specdev {
@@ -78,6 +79,7 @@ typedef struct specdev {
 	u_int		sd_opencnt;
 	u_int		sd_refcnt;
 	dev_t		sd_rdev;
+	bool		sd_opened;
 } specdev_t;
 
 /*
diff --git a/sys/modules/examples/pollpal/pollpal.c b/sys/modules/examples/pollpal/pollpal.c
index b76e0733699b..d8ddc73450e0 100644
--- a/sys/modules/examples/pollpal/pollpal.c
+++ b/sys/modules/examples/pollpal/pollpal.c
@@ -311,7 +311,8 @@ pollpal_modcmd(modcmd_t cmd, void *arg __unused)
 	case MODULE_CMD_FINI:
 		if (pollpal_nopen != 0)
 			return EBUSY;
-		return devsw_detach(NULL, &pollpal_cdevsw);
+		devsw_detach(NULL, &pollpal_cdevsw);
+		return 0;
 	default:
 		return ENOTTY;
 	} 
diff --git a/sys/net/if_tap.c b/sys/net/if_tap.c
index 0b57ad4a711c..314f4647707c 100644
--- a/sys/net/if_tap.c
+++ b/sys/net/if_tap.c
@@ -256,9 +256,7 @@ tapdetach(void)
 
 	if_clone_detach(&tap_cloners);
 #ifdef _MODULE
-	error = devsw_detach(NULL, &tap_cdevsw);
-	if (error != 0)
-		goto out2;
+	devsw_detach(NULL, &tap_cdevsw);
 #endif
 
 	if (tap_count != 0) {
@@ -277,7 +275,6 @@ tapdetach(void)
  out1:
 #ifdef _MODULE
 	devsw_attach("tap", NULL, &tap_bmajor, &tap_cdevsw, &tap_cmajor);
- out2:
 #endif
 	if_clone_attach(&tap_cloners);
 
diff --git a/sys/net/if_tun.c b/sys/net/if_tun.c
index 4f533a8f08d1..f4e5b6d86d43 100644
--- a/sys/net/if_tun.c
+++ b/sys/net/if_tun.c
@@ -142,17 +142,10 @@ tuninit(void)
 static int
 tundetach(void)
 {
-#ifdef _MODULE
-	int error;
-#endif
 
 	if_clone_detach(&tun_cloner);
 #ifdef _MODULE
-	error = devsw_detach(NULL, &tun_cdevsw);
-	if (error != 0) {
-		if_clone_attach(&tun_cloner);
-		return error;
-	}
+	devsw_detach(NULL, &tun_cdevsw);
 #endif
 
 	if (!LIST_EMPTY(&tun_softc_list) || !LIST_EMPTY(&tunz_softc_list)) {
diff --git a/sys/rump/dev/lib/libbpf/bpf_component.c b/sys/rump/dev/lib/libbpf/bpf_component.c
index 05807d371d40..d41d1987afe8 100644
--- a/sys/rump/dev/lib/libbpf/bpf_component.c
+++ b/sys/rump/dev/lib/libbpf/bpf_component.c
@@ -50,6 +50,5 @@ RUMP_COMPONENT(RUMP_COMPONENT_NET)
 		panic("bpf devsw attach failed: %d", error);
 	if ((error = rump_vfs_makeonedevnode(S_IFCHR, "/dev/bpf", cmaj, 0)) !=0)
 		panic("cannot create bpf device nodes: %d", error);
-	if ((error = devsw_detach(NULL, &bpf_cdevsw)) != 0)
-		panic("cannot detach bpf devsw: %d", error);
+	devsw_detach(NULL, &bpf_cdevsw);
 }
diff --git a/sys/rump/dev/lib/libdrvctl/drvctl_component.c b/sys/rump/dev/lib/libdrvctl/drvctl_component.c
index e2e79f45f9de..ac4e103fdb9c 100644
--- a/sys/rump/dev/lib/libdrvctl/drvctl_component.c
+++ b/sys/rump/dev/lib/libdrvctl/drvctl_component.c
@@ -51,7 +51,5 @@ RUMP_COMPONENT(RUMP_COMPONENT_DEV)
 	if ( error !=0)
 		panic("cannot create drvctl device node: %d", error);
 
-	error = devsw_detach(NULL, &drvctl_cdevsw);
-	if (error != 0)
-		panic("cannot detach drvctl devsw: %d", error);
+	devsw_detach(NULL, &drvctl_cdevsw);
 }
diff --git a/sys/sys/conf.h b/sys/sys/conf.h
index 081631d2111f..16dd87e5480c 100644
--- a/sys/sys/conf.h
+++ b/sys/sys/conf.h
@@ -63,7 +63,7 @@ struct vnode;
 #define	D_TYPEMASK	0x00ff
 #define	D_MPSAFE	0x0100
 #define	D_NEGOFFSAFE	0x0200
-#define	D_MCLOSE	0x0400
+#define	D_UNUSED0	0x0400	/* was D_MCLOSE */
 
 /*
  * Block device switch table
@@ -76,6 +76,8 @@ struct bdevsw {
 	int		(*d_dump)(dev_t, daddr_t, void *, size_t);
 	int		(*d_psize)(dev_t);
 	int		(*d_discard)(dev_t, off_t, off_t);
+	int		(*d_devtounit)(dev_t);
+	struct cfdriver	*d_cfdriver;
 	int		d_flag;
 };
 
@@ -94,6 +96,8 @@ struct cdevsw {
 	paddr_t		(*d_mmap)(dev_t, off_t, int);
 	int		(*d_kqfilter)(dev_t, struct knote *);
 	int		(*d_discard)(dev_t, off_t, off_t);
+	int		(*d_devtounit)(dev_t);
+	struct cfdriver	*d_cfdriver;
 	int		d_flag;
 };
 
@@ -104,7 +108,7 @@ extern kmutex_t device_lock;
 
 int devsw_attach(const char *, const struct bdevsw *, devmajor_t *,
 		 const struct cdevsw *, devmajor_t *);
-int devsw_detach(const struct bdevsw *, const struct cdevsw *);
+void devsw_detach(const struct bdevsw *, const struct cdevsw *);
 const struct bdevsw *bdevsw_lookup(dev_t);
 const struct cdevsw *cdevsw_lookup(dev_t);
 devmajor_t bdevsw_lookup_major(const struct bdevsw *);
@@ -276,6 +280,7 @@ devmajor_t devsw_name2blk(const char *, char *, size_t);
 devmajor_t devsw_name2chr(const char *, char *, size_t);
 dev_t devsw_chr2blk(dev_t);
 dev_t devsw_blk2chr(dev_t);
+int dev_minor_unit(dev_t);
 
 void mm_init(void);
 #endif /* _KERNEL */
diff --git a/sys/sys/device.h b/sys/sys/device.h
index 3bd4a6c3abf7..e685419d4925 100644
--- a/sys/sys/device.h
+++ b/sys/sys/device.h
@@ -274,10 +274,12 @@ struct device {
 	void		*dv_private;	/* this device's private storage */
 	int		*dv_locators;	/* our actual locators (optional) */
 	prop_dictionary_t dv_properties;/* properties dictionary */
+	struct localcount *dv_localcount;/* reference count */
 
 	int		dv_pending;	/* config_pending count */
 	TAILQ_ENTRY(device) dv_pending_list;
 
+	struct lwp	*dv_attaching;	/* thread not yet finished in attach */
 	struct lwp	*dv_detaching;	/* detach lock (config_misc_lock/cv) */
 
 	size_t		dv_activity_count;
@@ -651,6 +653,10 @@ void	null_childdetached(device_t, device_t);
 
 device_t	device_lookup(cfdriver_t, int);
 void		*device_lookup_private(cfdriver_t, int);
+
+device_t	device_lookup_acquire(cfdriver_t, int);
+void		device_release(device_t);
+
 void		device_register(device_t, void *);
 void		device_register_post_config(device_t, void *);
 
diff --git a/sys/sys/disklabel.h b/sys/sys/disklabel.h
index 4e94b8671332..853cdbe668a3 100644
--- a/sys/sys/disklabel.h
+++ b/sys/sys/disklabel.h
@@ -509,6 +509,7 @@ const char *convertdisklabel(struct disklabel *, void (*)(struct buf *),
 int	 bounds_check_with_label(struct disk *, struct buf *, int);
 int	 bounds_check_with_mediasize(struct buf *, int, uint64_t);
 const char *getfstypename(int);
+int	disklabel_dev_unit(dev_t);
 #endif
 #endif /* _LOCORE */
 
>From ef9b408b26eafca5b9a984ebd9ccede62db38881 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sun, 23 Jan 2022 13:46:26 +0000
Subject: [PATCH] ucom(4): Rework open/close/attach/detach logic.

- Defer sleep after hangup until open.

  No need to make close hang; we just need to make sure some time has
  passed before we next try to open.

  This changes the wchan for the sleep.  Oh well.

- Use .d_cfdriver/devtounit/cancel to

- Use a separate .sc_closing flag instead of a UCOM_CLOSING state.
  ucomcancel/ucomclose owns this flag, and it may be set in any state
  (except UCOM_DEAD).  UCOM_OPENING remains owned by ucomopen, which
  might be interrupted by cancel/close.

- Rework error branches in ucomopen.  Much simpler this way.

- Nix unnecessary reference counting.
---
 sys/dev/usb/ucom.c | 569 ++++++++++++++++++---------------------------
 1 file changed, 229 insertions(+), 340 deletions(-)

diff --git a/sys/dev/usb/ucom.c b/sys/dev/usb/ucom.c
index 2dd1a141b19e..b1fdb93e421c 100644
--- a/sys/dev/usb/ucom.c
+++ b/sys/dev/usb/ucom.c
@@ -180,11 +180,10 @@ struct ucom_softc {
 	    UCOM_DEAD,
 	    UCOM_ATTACHED,
 	    UCOM_OPENING,
-	    UCOM_CLOSING,
 	    UCOM_OPEN
 	}			sc_state;
-	int			sc_refcnt;
-	bool			sc_dying;	/* disconnecting */
+	bool			sc_closing;	/* software is closing */
+	bool			sc_dying;	/* hardware is gone */
 
 	struct pps_state	sc_pps_state;	/* pps state */
 
@@ -192,10 +191,12 @@ struct ucom_softc {
 
 	kmutex_t		sc_lock;
 	kcondvar_t		sc_statecv;
-	kcondvar_t		sc_detachcv;
+
+	struct timeval		sc_hup_time;
 };
 
 dev_type_open(ucomopen);
+dev_type_cancel(ucomcancel);
 dev_type_close(ucomclose);
 dev_type_read(ucomread);
 dev_type_write(ucomwrite);
@@ -206,6 +207,7 @@ dev_type_poll(ucompoll);
 
 const struct cdevsw ucom_cdevsw = {
 	.d_open = ucomopen,
+	.d_cancel = ucomcancel,
 	.d_close = ucomclose,
 	.d_read = ucomread,
 	.d_write = ucomwrite,
@@ -216,16 +218,16 @@ const struct cdevsw ucom_cdevsw = {
 	.d_mmap = nommap,
 	.d_kqfilter = ttykqfilter,
 	.d_discard = nodiscard,
+	.d_cfdriver = &ucom_cd,
+	.d_devtounit = dev_minor_unit,
 	.d_flag = D_TTY | D_MPSAFE
 };
 
-static void	ucom_cleanup(struct ucom_softc *);
+static void	ucom_cleanup(struct ucom_softc *, int);
 static int	ucomparam(struct tty *, struct termios *);
 static int	ucomhwiflow(struct tty *, int);
 static void	ucomstart(struct tty *);
 static void	ucom_shutdown(struct ucom_softc *);
-static int	ucom_do_ioctl(struct ucom_softc *, u_long, void *,
-			      int, struct lwp *);
 static void	ucom_dtr(struct ucom_softc *, int);
 static void	ucom_rts(struct ucom_softc *, int);
 static void	ucom_break(struct ucom_softc *, int);
@@ -288,14 +290,13 @@ ucom_attach(device_t parent, device_t self, void *aux)
 	sc->sc_mcr = 0;
 	sc->sc_tx_stopped = 0;
 	sc->sc_swflags = 0;
-	sc->sc_refcnt = 0;
+	sc->sc_closing = false;
 	sc->sc_dying = false;
 	sc->sc_state = UCOM_DEAD;
 
 	sc->sc_si = softint_establish(SOFTINT_USB, ucom_softintr, sc);
 	mutex_init(&sc->sc_lock, MUTEX_DEFAULT, IPL_SOFTUSB);
 	cv_init(&sc->sc_statecv, "ucomstate");
-	cv_init(&sc->sc_detachcv, "ucomdtch");
 
 	SIMPLEQ_INIT(&sc->sc_ibuff_empty);
 	SIMPLEQ_INIT(&sc->sc_ibuff_full);
@@ -366,7 +367,6 @@ ucom_attach(device_t parent, device_t self, void *aux)
 		aprint_error_dev(self, "couldn't establish power handler\n");
 
 	sc->sc_state = UCOM_ATTACHED;
-
 	return;
 
 fail_2:
@@ -375,8 +375,8 @@ fail_2:
 		if (ub->ub_xfer)
 			usbd_destroy_xfer(ub->ub_xfer);
 	}
-
 	usbd_close_pipe(sc->sc_bulkout_pipe);
+	sc->sc_bulkout_pipe = NULL;
 
 fail_1:
 	for (ub = &sc->sc_ibuff[0]; ub != &sc->sc_ibuff[UCOM_IN_BUFFS];
@@ -385,11 +385,10 @@ fail_1:
 			usbd_destroy_xfer(ub->ub_xfer);
 	}
 	usbd_close_pipe(sc->sc_bulkin_pipe);
+	sc->sc_bulkin_pipe = NULL;
 
 fail_0:
 	aprint_error_dev(self, "attach failed, error=%d\n", error);
-
-	return;
 }
 
 int
@@ -406,49 +405,21 @@ ucom_detach(device_t self, int flags)
 	    (uintptr_t)tp, 0);
 	DPRINTF("... pipe=%jd,%jd", sc->sc_bulkin_no, sc->sc_bulkout_no, 0, 0);
 
+	/* Prevent new opens from hanging.  */
 	mutex_enter(&sc->sc_lock);
 	sc->sc_dying = true;
 	mutex_exit(&sc->sc_lock);
 
 	pmf_device_deregister(self);
 
-	if (sc->sc_bulkin_pipe != NULL) {
-		usbd_abort_pipe(sc->sc_bulkin_pipe);
-	}
-	if (sc->sc_bulkout_pipe != NULL) {
-		usbd_abort_pipe(sc->sc_bulkout_pipe);
-	}
-
-	mutex_enter(&sc->sc_lock);
-
-	/* wait for open/close to finish */
-	while (sc->sc_state == UCOM_OPENING || sc->sc_state == UCOM_CLOSING) {
-		int error = cv_wait_sig(&sc->sc_statecv, &sc->sc_lock);
-
-		if (error) {
-			mutex_exit(&sc->sc_lock);
-			return error;
-		}
-	}
-
-	sc->sc_refcnt--;
-	while (sc->sc_refcnt >= 0) {
-		/* Wake up anyone waiting */
-		if (tp != NULL) {
-			mutex_spin_enter(&tty_lock);
-			CLR(tp->t_state, TS_CARR_ON);
-			CLR(tp->t_cflag, CLOCAL | MDMBUF);
-			ttyflush(tp, FREAD|FWRITE);
-			mutex_spin_exit(&tty_lock);
-		}
-		/* Wait for processes to go away. */
-		if (cv_timedwait(&sc->sc_detachcv, &sc->sc_lock, hz * 60))
-			aprint_error_dev(self, ": didn't detach\n");
+	/* tty is now off.  */
+	if (tp != NULL) {
+		mutex_spin_enter(&tty_lock);
+		CLR(tp->t_state, TS_CARR_ON);
+		CLR(tp->t_cflag, CLOCAL | MDMBUF);
+		mutex_spin_exit(&tty_lock);
 	}
 
-	softint_disestablish(sc->sc_si);
-	mutex_exit(&sc->sc_lock);
-
 	/* locate the major number */
 	maj = cdevsw_lookup_major(&ucom_cdevsw);
 
@@ -459,6 +430,8 @@ ucom_detach(device_t self, int flags)
 	vdevgone(maj, mn | UCOMDIALOUT_MASK, mn | UCOMDIALOUT_MASK, VCHR);
 	vdevgone(maj, mn | UCOMCALLUNIT_MASK, mn | UCOMCALLUNIT_MASK, VCHR);
 
+	softint_disestablish(sc->sc_si);
+
 	/* Detach and free the tty. */
 	if (tp != NULL) {
 		tty_detach(tp);
@@ -491,7 +464,6 @@ ucom_detach(device_t self, int flags)
 
 	mutex_destroy(&sc->sc_lock);
 	cv_destroy(&sc->sc_statecv);
-	cv_destroy(&sc->sc_detachcv);
 
 	return 0;
 }
@@ -509,11 +481,23 @@ ucom_shutdown(struct ucom_softc *sc)
 	 */
 	if (ISSET(tp->t_cflag, HUPCL)) {
 		ucom_dtr(sc, 0);
-		/* XXX will only timeout */
-		(void) kpause(ttclos, false, hz, NULL);
+		mutex_enter(&sc->sc_lock);
+		microuptime(&sc->sc_hup_time);
+		sc->sc_hup_time.tv_sec++;
+		mutex_exit(&sc->sc_lock);
 	}
 }
 
+/*
+ * ucomopen(dev, flag, mode, l)
+ *
+ *	Called when anyone tries to open /dev/ttyU? for an existing
+ *	ucom? instance that has completed attach.  The attach may have
+ *	failed, though, or there may be concurrent detach or close in
+ *	progress, so fail if attach failed (no sc_tty) or detach has
+ *	begun (sc_dying), or wait if there's a concurrent close in
+ *	progress before reopening.
+ */
 int
 ucomopen(dev_t dev, int flag, int mode, struct lwp *l)
 {
@@ -523,14 +507,11 @@ ucomopen(dev_t dev, int flag, int mode, struct lwp *l)
 
 	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
-	if (sc == NULL)
-		return ENXIO;
-
 	mutex_enter(&sc->sc_lock);
 	if (sc->sc_dying) {
 		DPRINTF("... dying", 0, 0, 0, 0);
 		mutex_exit(&sc->sc_lock);
-		return EIO;
+		return ENXIO;
 	}
 
 	if (!device_is_active(sc->sc_dev)) {
@@ -539,6 +520,11 @@ ucomopen(dev_t dev, int flag, int mode, struct lwp *l)
 	}
 
 	struct tty *tp = sc->sc_tty;
+	if (tp == NULL) {
+		DPRINTF("... not attached", 0, 0, 0, 0);
+		mutex_exit(&sc->sc_lock);
+		return ENXIO;
+	}
 
 	DPRINTF("unit=%jd, tp=%#jx", unit, (uintptr_t)tp, 0, 0);
 
@@ -548,40 +534,69 @@ ucomopen(dev_t dev, int flag, int mode, struct lwp *l)
 	}
 
 	/*
-	 * Wait while the device is initialized by the
-	 * first opener or cleaned up by the last closer.
+	 * If the previous use had set DTR on close, wait up to one
+	 * second for the other side to notice we hung up.  After
+	 * sleeping, the tty may have been revoked, so restart the
+	 * whole operation.
+	 *
+	 * XXX The wchan is not ttclose but maybe should be.
 	 */
-	while (sc->sc_state == UCOM_OPENING || sc->sc_state == UCOM_CLOSING) {
-		error = cv_wait_sig(&sc->sc_statecv, &sc->sc_lock);
-
-		if (sc->sc_dying)
-			error = EIO;
-
-		if (error) {
+	if (timerisset(&sc->sc_hup_time)) {
+		struct timeval now, delta;
+		int ms, ticks;
+
+		microuptime(&now);
+		if (timercmp(&now, &sc->sc_hup_time, >=)) {
+			/* good to go */
+		} else if (flag & FNONBLOCK) {
 			mutex_exit(&sc->sc_lock);
-			return error;
+			return EWOULDBLOCK;
+		} else {
+			timersub(&sc->sc_hup_time, &now, &delta);
+			ms = MIN(INT_MAX - 1000, delta.tv_sec*1000);
+			ms += howmany(delta.tv_usec, 1000);
+			ticks = MAX(1, MIN(INT_MAX, mstohz(ms)));
+			error = cv_timedwait_sig(&sc->sc_statecv, &sc->sc_lock,
+			    ticks);
+			mutex_exit(&sc->sc_lock);
+			return error ? error : ERESTART;
 		}
 	}
-	enum ucom_state state = sc->sc_state;
 
+	/*
+	 * Wait while the device is initialized by the
+	 * first opener or cleaned up by the last closer.
+	 */
+	enum ucom_state state = sc->sc_state;
+	if (state == UCOM_OPENING || sc->sc_closing) {
+		if (flag & FNONBLOCK)
+			error = EWOULDBLOCK;
+		else
+			error = cv_wait_sig(&sc->sc_statecv, &sc->sc_lock);
+		mutex_exit(&sc->sc_lock);
+		return error ? error : ERESTART;
+	}
 	KASSERTMSG(state == UCOM_OPEN || state == UCOM_ATTACHED,
 	    "state is %d", state);
 
-	bool cleanup = false;
-	/* If this is the first open then perform the initialisation */
-	if (!ISSET(tp->t_state, TS_ISOPEN) && tp->t_wopen == 0) {
-		KASSERT(state == UCOM_ATTACHED);
+	/*
+	 * If this is the first open, then make sure the pipes are
+	 * running and perform any initialization needed.
+	 */
+	bool firstopen = (state == UCOM_ATTACHED);
+	if (firstopen) {
+		KASSERT(!ISSET(tp->t_state, TS_ISOPEN));
+		KASSERT(tp->t_wopen == 0);
+
 		tp->t_dev = dev;
-		cleanup = true;
 		sc->sc_state = UCOM_OPENING;
-
 		mutex_exit(&sc->sc_lock);
+
 		if (sc->sc_methods->ucom_open != NULL) {
 			error = sc->sc_methods->ucom_open(sc->sc_parent,
 			    sc->sc_portno);
-			if (error) {
-				goto fail_2;
-			}
+			if (error)
+				goto bad;
 		}
 
 		ucom_status_change(sc);
@@ -628,12 +643,6 @@ ucomopen(dev_t dev, int flag, int mode, struct lwp *l)
 		ucom_rts(sc, 1);
 
 		mutex_enter(&sc->sc_lock);
-		if (sc->sc_dying) {
-			DPRINTF("... dying", 0, 0, 0, 0);
-			error = EIO;
-			goto fail_1;
-		}
-
 		sc->sc_rx_unblock = 0;
 		sc->sc_rx_stopped = 0;
 		sc->sc_tx_stopped = 0;
@@ -643,12 +652,10 @@ ucomopen(dev_t dev, int flag, int mode, struct lwp *l)
 			error = ucomsubmitread(sc, ub);
 			if (error) {
 		    		mutex_exit(&sc->sc_lock);
-				goto fail_2;
+				goto bad;
 			}
 		}
 	}
-	sc->sc_state = UCOM_OPEN;
-	cv_signal(&sc->sc_statecv);
 	mutex_exit(&sc->sc_lock);
 
 	DPRINTF("unit=%jd, tp=%#jx dialout %jd nonblock %jd", unit,
@@ -661,99 +668,140 @@ ucomopen(dev_t dev, int flag, int mode, struct lwp *l)
 	if (error)
 		goto bad;
 
+	/*
+	 * Success!  If this was the first open, notify waiters that
+	 * the tty is open for business.
+	 */
+	if (firstopen) {
+		mutex_enter(&sc->sc_lock);
+		KASSERT(sc->sc_state == UCOM_OPENING);
+		sc->sc_state = UCOM_OPEN;
+		cv_broadcast(&sc->sc_statecv);
+		mutex_exit(&sc->sc_lock);
+	}
 	return 0;
 
 bad:
-	cleanup = !ISSET(tp->t_state, TS_ISOPEN) && tp->t_wopen == 0;
+	/*
+	 * Failure!  If this was the first open, hang up, abort pipes,
+	 * and notify waiters that we're not opening after all.
+	 */
+	if (firstopen) {
+		ucom_cleanup(sc, flag);
 
-fail_2:
-	if (cleanup) {
-		/*
-		 * We failed to open the device, and nobody else had
-		 * it opened.  Clean up the state as appropriate.
-		 */
-		ucom_cleanup(sc);
+		mutex_enter(&sc->sc_lock);
+		KASSERT(sc->sc_state == UCOM_OPENING);
+		sc->sc_state = UCOM_ATTACHED;
+		cv_broadcast(&sc->sc_statecv);
+		mutex_exit(&sc->sc_lock);
 	}
-
-	mutex_enter(&sc->sc_lock);
-
-fail_1:
-	sc->sc_state = state;
-	cv_signal(&sc->sc_statecv);
-	mutex_exit(&sc->sc_lock);
-
 	return error;
 }
 
+/*
+ * ucomcancel(dev, flag, mode, l)
+ *
+ *	Called on revoke or last close.  Must interrupt any pending I/O
+ *	operations and make them fail promptly; once they have all
+ *	finished (except possibly new opens), ucomclose will be called.
+ *	We set sc_closing to block new opens until ucomclose runs.
+ */
 int
-ucomclose(dev_t dev, int flag, int mode, struct lwp *l)
+ucomcancel(dev_t dev, int flag, int mode, struct lwp *l)
 {
 	const int unit = UCOMUNIT(dev);
 	struct ucom_softc *sc = device_lookup_private(&ucom_cd, unit);
-	int error = 0;
 
 	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
 	DPRINTF("unit=%jd", UCOMUNIT(dev), 0, 0, 0);
 
+	/*
+	 * This can run at any time before ucomclose on any device
+	 * node, even if never attached or if attach failed, so we may
+	 * not have a softc or a tty.
+	 */
 	if (sc == NULL)
 		return 0;
+	struct tty *tp = sc->sc_tty;
+	if (tp == NULL)
+		return 0;
 
+	/*
+	 * Mark the device closing so opens block until we're done
+	 * closing.  Wake them up so they start over at the top -- if
+	 * we're closing because we're detaching, they need to wake up
+	 * and notice it's time to fail.
+	 */
 	mutex_enter(&sc->sc_lock);
-	if (sc->sc_dying) {
-		DPRINTF("... dying", 0, 0, 0, 0);
-		mutex_exit(&sc->sc_lock);
-		return ENXIO;
-	}
+	sc->sc_closing = true;
+	cv_broadcast(&sc->sc_statecv);
+	mutex_exit(&sc->sc_lock);
 
 	/*
-	 * Wait until any opens/closes have finished
+	 * Close the tty, causing anyone waiting for it to wake, and
+	 * hang up the phone.
 	 */
-	while (sc->sc_state == UCOM_OPENING || sc->sc_state == UCOM_CLOSING) {
-		error = cv_wait_sig(&sc->sc_statecv, &sc->sc_lock);
+	ucom_cleanup(sc, flag);
 
-		if (sc->sc_dying)
-			error = EIO;
+	return 0;
+}
 
-		if (error) {
-			mutex_exit(&sc->sc_lock);
-			return error;
-		}
-	}
+/*
+ * ucomclose(dev, flag, mode, l)
+ *
+ *	Called after ucomcancel, when all prior operations on the /dev
+ *	node have completed.  Only new opens may be in progress at this
+ *	point, but they will block until sc_closing is set to false.
+ */
+int
+ucomclose(dev_t dev, int flag, int mode, struct lwp *l)
+{
+	const int unit = UCOMUNIT(dev);
+	struct ucom_softc *sc = device_lookup_private(&ucom_cd, unit);
+	int error = 0;
 
-	struct tty *tp = sc->sc_tty;
+	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
-	if (!ISSET(tp->t_state, TS_ISOPEN)) {
-		KASSERT(sc->sc_state == UCOM_ATTACHED);
-		mutex_exit(&sc->sc_lock);
-		return 0;
-	}
+	DPRINTF("unit=%jd", UCOMUNIT(dev), 0, 0, 0);
 
-	KASSERT(sc->sc_state == UCOM_OPEN);
-	sc->sc_state = UCOM_CLOSING;
-	mutex_exit(&sc->sc_lock);
+	/*
+	 * This can run at any time after ucomcancel on any device
+	 * node, even if never attached or if attach failed, so we may
+	 * not have a softc or a tty.
+	 */
+	if (sc == NULL)
+		return 0;
+	struct tty *tp = sc->sc_tty;
+	if (tp == NULL)
+		return 0;
 
-	(*tp->t_linesw->l_close)(tp, flag);
-	ttyclose(tp);
+	/*
+	 * ttyclose should have cleared TS_ISOPEN and interrupted all
+	 * pending opens, which should have completed by now.
+	 */
+	mutex_spin_enter(&tty_lock);
+	KASSERT(!ISSET(tp->t_state, TS_ISOPEN));
+	KASSERT(tp->t_wopen == 0);
+	mutex_spin_exit(&tty_lock);
 
-	/* state when we're done - default to open */
-	enum ucom_state state = UCOM_OPEN;
-	if (!ISSET(tp->t_state, TS_ISOPEN) && tp->t_wopen == 0) {
-		/*
-		 * Although we got a last close, the device may still be in
-		 * use; e.g. if this was the dialout node, and there are still
-		 * processes waiting for carrier on the non-dialout node.
-		 */
-		ucom_cleanup(sc);
-		if (sc->sc_methods->ucom_close != NULL)
-			sc->sc_methods->ucom_close(sc->sc_parent,
-			    sc->sc_portno);
-		state = UCOM_ATTACHED;
+	/*
+	 * Close any device-specific state.
+	 */
+	if (sc->sc_methods->ucom_close != NULL) {
+		sc->sc_methods->ucom_close(sc->sc_parent,
+		    sc->sc_portno);
 	}
 
+	/*
+	 * We're now closed.  Can reopen after this point, so resume
+	 * transfers, mark us no longer closing, and notify anyone
+	 * waiting in open.
+	 */
 	mutex_enter(&sc->sc_lock);
-	sc->sc_state = state;
-	cv_signal(&sc->sc_statecv);
+	KASSERT(sc->sc_closing);
+	sc->sc_closing = false;
+	cv_broadcast(&sc->sc_statecv);
 	mutex_exit(&sc->sc_lock);
 
 	return error;
@@ -764,34 +812,11 @@ ucomread(dev_t dev, struct uio *uio, int flag)
 {
 	const int unit = UCOMUNIT(dev);
 	struct ucom_softc * const sc = device_lookup_private(&ucom_cd, unit);
-	int error;
-
-	UCOMHIST_FUNC(); UCOMHIST_CALLED();
-
-	if (sc == NULL)
-		return EIO;
-
-	mutex_enter(&sc->sc_lock);
-	if (sc->sc_dying) {
-		DPRINTF("... dying", 0, 0, 0, 0);
-		mutex_exit(&sc->sc_lock);
-		return EIO;
-	}
-
 	struct tty *tp = sc->sc_tty;
 
-	sc->sc_refcnt++;
-	mutex_exit(&sc->sc_lock);
-
-	error = ((*tp->t_linesw->l_read)(tp, uio, flag));
-
-	mutex_enter(&sc->sc_lock);
-	if (--sc->sc_refcnt < 0)
-		cv_broadcast(&sc->sc_detachcv);
-	DPRINTF("unit=%jd refcnt %jd", UCOMUNIT(dev), sc->sc_refcnt, 0, 0);
-	mutex_exit(&sc->sc_lock);
+	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
-	return error;
+	return (*tp->t_linesw->l_read)(tp, uio, flag);
 }
 
 int
@@ -799,34 +824,11 @@ ucomwrite(dev_t dev, struct uio *uio, int flag)
 {
 	const int unit = UCOMUNIT(dev);
 	struct ucom_softc * const sc = device_lookup_private(&ucom_cd, unit);
-	int error;
-
-	UCOMHIST_FUNC(); UCOMHIST_CALLED();
-
-	if (sc == NULL)
-		return EIO;
-
-	mutex_enter(&sc->sc_lock);
-	if (sc->sc_dying) {
-		DPRINTF("... dying", 0, 0, 0, 0);
-		mutex_exit(&sc->sc_lock);
-		return EIO;
-	}
-
 	struct tty *tp = sc->sc_tty;
 
-	sc->sc_refcnt++;
-	mutex_exit(&sc->sc_lock);
-
-	error = ((*tp->t_linesw->l_write)(tp, uio, flag));
-
-	mutex_enter(&sc->sc_lock);
-	if (--sc->sc_refcnt < 0)
-		cv_broadcast(&sc->sc_detachcv);
-	DPRINTF("unit=%jd refcnt %jd", UCOMUNIT(dev), sc->sc_refcnt, 0, 0);
-	mutex_exit(&sc->sc_lock);
+	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
-	return error;
+	return (*tp->t_linesw->l_write)(tp, uio, flag);
 }
 
 int
@@ -834,32 +836,11 @@ ucompoll(dev_t dev, int events, struct lwp *l)
 {
 	const int unit = UCOMUNIT(dev);
 	struct ucom_softc * const sc = device_lookup_private(&ucom_cd, unit);
-
-	UCOMHIST_FUNC(); UCOMHIST_CALLED();
-
-	if (sc == NULL)
-		return POLLHUP;
-
-	mutex_enter(&sc->sc_lock);
-	if (sc->sc_dying) {
-		DPRINTF("... dying", 0, 0, 0, 0);
-		mutex_exit(&sc->sc_lock);
-		return POLLHUP;
-	}
 	struct tty *tp = sc->sc_tty;
 
-	sc->sc_refcnt++;
-	mutex_exit(&sc->sc_lock);
-
-	int revents = ((*tp->t_linesw->l_poll)(tp, events, l));
-
-	mutex_enter(&sc->sc_lock);
-	if (--sc->sc_refcnt < 0)
-		cv_broadcast(&sc->sc_detachcv);
-	DPRINTF("unit=%jd refcnt %jd", UCOMUNIT(dev), sc->sc_refcnt, 0, 0);
-	mutex_exit(&sc->sc_lock);
+	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
-	return revents;
+	return (*tp->t_linesw->l_poll)(tp, events, l);
 }
 
 struct tty *
@@ -868,7 +849,7 @@ ucomtty(dev_t dev)
 	const int unit = UCOMUNIT(dev);
 	struct ucom_softc * const sc = device_lookup_private(&ucom_cd, unit);
 
-	return sc != NULL ? sc->sc_tty : NULL;
+	return sc->sc_tty;
 }
 
 int
@@ -876,38 +857,6 @@ ucomioctl(dev_t dev, u_long cmd, void *data, int flag, struct lwp *l)
 {
 	const int unit = UCOMUNIT(dev);
 	struct ucom_softc * const sc = device_lookup_private(&ucom_cd, unit);
-	int error;
-
-	UCOMHIST_FUNC(); UCOMHIST_CALLED();
-
-	if (sc == NULL)
-		return EIO;
-
-	mutex_enter(&sc->sc_lock);
-	if (sc->sc_dying) {
-		DPRINTF("... dying", 0, 0, 0, 0);
-		mutex_exit(&sc->sc_lock);
-		return EIO;
-	}
-
-	sc->sc_refcnt++;
-	mutex_exit(&sc->sc_lock);
-
-	error = ucom_do_ioctl(sc, cmd, data, flag, l);
-
-	mutex_enter(&sc->sc_lock);
-	if (--sc->sc_refcnt < 0)
-		cv_broadcast(&sc->sc_detachcv);
-	DPRINTF("unit=%jd refcnt %jd", UCOMUNIT(dev), sc->sc_refcnt, 0, 0);
-	mutex_exit(&sc->sc_lock);
-
-	return error;
-}
-
-static int
-ucom_do_ioctl(struct ucom_softc *sc, u_long cmd, void *data, int flag,
-    struct lwp *l)
-{
 	struct tty *tp = sc->sc_tty;
 	int error;
 
@@ -1147,20 +1096,9 @@ ucomparam(struct tty *tp, struct termios *t)
 	struct ucom_softc * const sc = device_lookup_private(&ucom_cd, unit);
 	int error = 0;
 
-	UCOMHIST_FUNC(); UCOMHIST_CALLED();
+	/* XXX should take tty lock around touching tp */
 
-	if (sc == NULL)
-		return EIO;
-
-	mutex_enter(&sc->sc_lock);
-	if (sc->sc_dying) {
-		DPRINTF("... dying", 0, 0, 0, 0);
-		mutex_exit(&sc->sc_lock);
-		return EIO;
-	}
-
-	sc->sc_refcnt++;
-	mutex_exit(&sc->sc_lock);
+	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
 	/* Check requested parameters. */
 	if (t->c_ispeed && t->c_ispeed != t->c_ospeed) {
@@ -1221,14 +1159,6 @@ XXX what if the hardware is not open
 	}
 #endif
 out:
-	mutex_enter(&sc->sc_lock);
-	if (--sc->sc_refcnt < 0)
-		cv_broadcast(&sc->sc_detachcv);
-	DPRINTF("unit=%jd refcnt %jd", UCOMUNIT(tp->t_dev), sc->sc_refcnt,
-	    0, 0);
-
-	mutex_exit(&sc->sc_lock);
-
 	return error;
 }
 
@@ -1241,9 +1171,6 @@ ucomhwiflow(struct tty *tp, int block)
 
 	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
-	if (sc == NULL)
-		return 0;
-
 	KASSERT(&sc->sc_lock);
 	KASSERT(mutex_owned(&tty_lock));
 
@@ -1271,16 +1198,6 @@ ucomstart(struct tty *tp)
 
 	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
-	if (sc == NULL)
-		return;
-
-	KASSERT(&sc->sc_lock);
-	KASSERT(mutex_owned(&tty_lock));
-	if (sc->sc_dying) {
-		DPRINTF("... dying", 0, 0, 0, 0);
-		return;
-	}
-
 	if (ISSET(tp->t_state, TS_BUSY | TS_TIMEOUT | TS_TTSTOP))
 		goto out;
 	if (sc->sc_tx_stopped)
@@ -1387,7 +1304,7 @@ ucom_write_status(struct ucom_softc *sc, struct ucom_buffer *ub,
 		mutex_spin_exit(&tty_lock);
 
 		if (err != USBD_CANCELLED && err != USBD_IOERROR &&
-		    !sc->sc_dying) {
+		    !sc->sc_closing) {
 			if ((ub = SIMPLEQ_FIRST(&sc->sc_obuff_full)) != NULL)
 				ucom_submit_write(sc, ub);
 
@@ -1428,15 +1345,9 @@ ucom_softintr(void *arg)
 	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
 	struct ucom_softc *sc = arg;
+	struct tty *tp = sc->sc_tty;
 
 	mutex_enter(&sc->sc_lock);
-	if (sc->sc_dying) {
-		DPRINTF("... dying", 0, 0, 0, 0);
-		mutex_exit(&sc->sc_lock);
-		return;
-	}
-
-	struct tty *tp = sc->sc_tty;
 	mutex_enter(&tty_lock);
 	if (!ISSET(tp->t_state, TS_ISOPEN)) {
 		mutex_exit(&tty_lock);
@@ -1483,11 +1394,7 @@ ucom_read_complete(struct ucom_softc *sc)
 
 		if (ub->ub_index == ub->ub_len) {
 			SIMPLEQ_REMOVE_HEAD(&sc->sc_ibuff_full, ub_link);
-
-			sc->sc_refcnt--;
-			/* increments sc_refcnt */
 			ucomsubmitread(sc, ub);
-
 			ub = SIMPLEQ_FIRST(&sc->sc_ibuff_full);
 		}
 	}
@@ -1511,8 +1418,6 @@ ucomsubmitread(struct ucom_softc *sc, struct ucom_buffer *ub)
 		return EIO;
 	}
 
-	sc->sc_refcnt++;
-
 	SIMPLEQ_INSERT_TAIL(&sc->sc_ibuff_empty, ub, ub_link);
 
 	return 0;
@@ -1522,6 +1427,7 @@ static void
 ucomreadcb(struct usbd_xfer *xfer, void *p, usbd_status status)
 {
 	struct ucom_softc *sc = (struct ucom_softc *)p;
+	struct tty *tp = sc->sc_tty;
 	struct ucom_buffer *ub;
 	uint32_t cc;
 	u_char *cp;
@@ -1530,15 +1436,13 @@ ucomreadcb(struct usbd_xfer *xfer, void *p, usbd_status status)
 
 	mutex_enter(&sc->sc_lock);
 
-	struct tty *tp = sc->sc_tty;
-
 	if (status == USBD_CANCELLED || status == USBD_IOERROR ||
-	    sc->sc_dying) {
+	    sc->sc_closing) {
 
-		DPRINTF("... done (status %jd dying %jd)", status, sc->sc_dying,
-		    0, 0);
+		DPRINTF("... done (status %jd closing %jd)",
+		    status, sc->sc_closing, 0, 0);
 
-		if (status == USBD_IOERROR || sc->sc_dying) {
+		if (status == USBD_IOERROR || sc->sc_closing) {
 			/* Send something to wake upper layer */
 			(tp->t_linesw->l_rint)('\n', tp);
 			mutex_spin_enter(&tty_lock);	/* XXX */
@@ -1546,12 +1450,7 @@ ucomreadcb(struct usbd_xfer *xfer, void *p, usbd_status status)
 			mutex_spin_exit(&tty_lock);	/* XXX */
 		}
 
-		if (--sc->sc_refcnt < 0)
-			cv_broadcast(&sc->sc_detachcv);
-		DPRINTF("unit=%jd refcnt %jd", UCOMUNIT(tp->t_dev),
-		    sc->sc_refcnt, 0, 0);
 		mutex_exit(&sc->sc_lock);
-
 		return;
 	}
 
@@ -1567,8 +1466,7 @@ ucomreadcb(struct usbd_xfer *xfer, void *p, usbd_status status)
 		}
 
 		DPRINTF("... done (status %jd)", status, 0, 0, 0);
-		sc->sc_refcnt--;
-		/* re-adds ub to sc_ibuff_empty and increments sc_refcnt */
+		/* re-adds ub to sc_ibuff_empty */
 		ucomsubmitread(sc, ub);
 		mutex_exit(&sc->sc_lock);
 		return;
@@ -1588,8 +1486,7 @@ ucomreadcb(struct usbd_xfer *xfer, void *p, usbd_status status)
 
 	if (sc->sc_state != UCOM_OPEN) {
 		/* Go around again - we're not quite ready */
-		sc->sc_refcnt--;
-		/* re-adds ub to sc_ibuff_empty and increments sc_refcnt */
+		/* re-adds ub to sc_ibuff_empty */
 		ucomsubmitread(sc, ub);
 		mutex_exit(&sc->sc_lock);
 		DPRINTF("... done (not open)", 0, 0, 0, 0);
@@ -1607,16 +1504,7 @@ ucomreadcb(struct usbd_xfer *xfer, void *p, usbd_status status)
 	ub->ub_len = cc;
 
 	mutex_enter(&sc->sc_lock);
-	if (sc->sc_dying) {
-		if (--sc->sc_refcnt < 0)
-			cv_broadcast(&sc->sc_detachcv);
-		mutex_exit(&sc->sc_lock);
-		DPRINTF("... dying", 0, 0, 0, 0);
-		return;
-	}
-
 	SIMPLEQ_INSERT_TAIL(&sc->sc_ibuff_full, ub, ub_link);
-
 	ucom_read_complete(sc);
 	mutex_exit(&sc->sc_lock);
 
@@ -1624,33 +1512,34 @@ ucomreadcb(struct usbd_xfer *xfer, void *p, usbd_status status)
 }
 
 static void
-ucom_cleanup(struct ucom_softc *sc)
+ucom_cleanup(struct ucom_softc *sc, int flag)
 {
+	struct tty *tp = sc->sc_tty;
 
 	UCOMHIST_FUNC(); UCOMHIST_CALLED();
 
-	DPRINTF("aborting pipes", 0, 0, 0, 0);
+	DPRINTF("closing pipes", 0, 0, 0, 0);
 
-	mutex_enter(&sc->sc_lock);
-
-	/* If we're dying then the detach routine will abort our pipes, etc */
-	if (sc->sc_dying) {
-		DPRINTF("... dying", 0, 0, 0, 0);
-
-		mutex_exit(&sc->sc_lock);
-		return;
-	}
+	/*
+	 * Close the tty and interrupt any pending opens waiting for
+	 * carrier so they restart or give up.  This may flush data.
+	 */
+	(*tp->t_linesw->l_close)(tp, flag);
+	ttyclose(tp);
 
-	mutex_exit(&sc->sc_lock);
+	/*
+	 * Interrupt any pending xfers and cause them to fail promptly.
+	 * New xfers will only be submitted under the lock after
+	 * sc_closing is cleared.
+	 */
+	usbd_abort_pipe(sc->sc_bulkin_pipe);
+	usbd_abort_pipe(sc->sc_bulkout_pipe);
 
+	/*
+	 * Hang up the phone and start the timer before we can make a
+	 * call again, if necessary.
+	 */
 	ucom_shutdown(sc);
-
-	if (sc->sc_bulkin_pipe != NULL) {
-		usbd_abort_pipe(sc->sc_bulkin_pipe);
-	}
-	if (sc->sc_bulkout_pipe != NULL) {
-		usbd_abort_pipe(sc->sc_bulkout_pipe);
-	}
 }
 
 #endif /* NUCOM > 0 */
>From ae9b001364d05c528c1dab0e61d6d8967aa79577 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Tue, 18 Jan 2022 01:11:50 +0000
Subject: [PATCH] driver(9): New devsw d_cancel op to interrupt I/O before
 close.

If specified, when revoking a device node or closing its last open
node, specfs will:

1. Call d_cancel, which should return promptly without blocking.
2. Wait for all concurrent d_read/write/ioctl/&c. to drain.
3. Call d_close.

Otherwise, specfs will:

1. Call d_close.
2. Wait for all concurrent d_read/write/ioctl/&c. to drain.

This fallback is problematic because often parts of d_close rely on
concurrent devsw operations to have completed already, so it is up to
each driver to have its own mechanism for waiting, and the extra step
in (2) is almost redundant.  But it is still important to ensure that
devsw operations are not active by the time a module tries to invoke
devsw_detach, because only d_open is protected against that.

The signature of d_cancel matches d_close, mostly so we don't raise
questions about `why is this different?'; the lwp argument is not
useful but we should remove it from open/cancel/close all at the same
time.

The only way d_cancel should fail, if it does at all, is with ENODEV,
meaning the driver doesn't support cancelling outstanding I/O, and
will take responsibility for that in d_close.  I would make it return
void and only have bdev_cancel and cdev_cancel possibly return ENODEV
so specfs can detect whether a driver supports it, but this would
break the pattern around devsw operation types.

Drivers are allowed to omit it from struct bdevsw, struct cdevsw --
if so, it is as if they used a function that just returns ENODEV.
---
 sys/kern/subr_devsw.c          | 36 ++++++++++++++++++++++++++++++++++
 sys/miscfs/specfs/spec_vnops.c | 31 ++++++++++++++++++++++-------
 sys/sys/conf.h                 |  5 +++++
 3 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/sys/kern/subr_devsw.c b/sys/kern/subr_devsw.c
index 8b55187b32c1..78d2cf6204c6 100644
--- a/sys/kern/subr_devsw.c
+++ b/sys/kern/subr_devsw.c
@@ -935,6 +935,24 @@ bdev_open(dev_t dev, int flag, int devtype, lwp_t *l)
 	return rv;
 }
 
+int
+bdev_cancel(dev_t dev, int flag, int devtype, struct lwp *l)
+{
+	const struct bdevsw *d;
+	int rv, mpflag;
+
+	if ((d = bdevsw_lookup(dev)) == NULL)
+		return ENXIO;
+	if (d->d_cancel == NULL)
+		return ENODEV;
+
+	DEV_LOCK(d);
+	rv = (*d->d_cancel)(dev, flag, devtype, l);
+	DEV_UNLOCK(d);
+
+	return rv;
+}
+
 int
 bdev_close(dev_t dev, int flag, int devtype, lwp_t *l)
 {
@@ -1112,6 +1130,24 @@ cdev_open(dev_t dev, int flag, int devtype, lwp_t *l)
 	return rv;
 }
 
+int
+cdev_cancel(dev_t dev, int flag, int devtype, struct lwp *l)
+{
+	const struct cdevsw *d;
+	int rv, mpflag;
+
+	if ((d = cdevsw_lookup(dev)) == NULL)
+		return ENXIO;
+	if (d->d_cancel == NULL)
+		return ENODEV;
+
+	DEV_LOCK(d);
+	rv = (*d->d_cancel)(dev, flag, devtype, l);
+	DEV_UNLOCK(d);
+
+	return rv;
+}
+
 int
 cdev_close(dev_t dev, int flag, int devtype, lwp_t *l)
 {
diff --git a/sys/miscfs/specfs/spec_vnops.c b/sys/miscfs/specfs/spec_vnops.c
index 987cd5969d95..797a828cfd72 100644
--- a/sys/miscfs/specfs/spec_vnops.c
+++ b/sys/miscfs/specfs/spec_vnops.c
@@ -317,24 +317,25 @@ spec_io_exit(struct vnode *vp, struct specnode *sn)
 }
 
 /*
- * spec_io_drain(vp, sn)
+ * spec_io_drain(vp, sn, n)
  *
- *	Wait for all existing spec_io_enter/exit sections to complete.
- *	Caller must ensure spec_io_enter will fail at this point.
+ *	Wait for all existing spec_io_enter/exit sections but n to
+ *	complete.  Caller must ensure spec_io_enter will fail at this
+ *	point.
  */
 static void
-spec_io_drain(struct vnode *vp, struct specnode *sn)
+spec_io_drain(struct vnode *vp, struct specnode *sn, unsigned n)
 {
 
 	/*
 	 * I/O at the same time as closing is unlikely -- it often
 	 * indicates an application bug.
 	 */
-	if (__predict_true(atomic_load_relaxed(&sn->sn_iocnt) == 0))
+	if (__predict_true(atomic_load_relaxed(&sn->sn_iocnt) <= n))
 		return;
 
 	mutex_enter(&device_lock);
-	while (atomic_load_relaxed(&sn->sn_iocnt) != 0)
+	while (atomic_load_relaxed(&sn->sn_iocnt) > n)
 		cv_wait(&specfs_iocv, &device_lock);
 	mutex_exit(&device_lock);
 }
@@ -553,7 +554,7 @@ spec_node_revoke(vnode_t *vp)
 	 * Note: We drain even if we witnessed sn->sn_opencnt == 0,
 	 * because there may have been a concurrent VOP_CLOSE.
 	 */
-	spec_io_drain(vp, sn);
+	spec_io_drain(vp, sn, 0);
 }
 
 /*
@@ -1614,6 +1615,22 @@ spec_close(void *v)
 		VOP_UNLOCK(vp);
 	}
 
+	/*
+	 * If we can cancel all outstanding I/O, then wait for it to
+	 * drain before we call .d_close.  Drivers that split up
+	 * .d_cancel and .d_close this way need not have any internal
+	 * mechanism for waiting in .d_close for I/O to drain.
+	 */
+	if (vp->v_type == VBLK)
+		error = bdev_cancel(dev, flags, mode, curlwp);
+	else
+		error = cdev_cancel(dev, flags, mode, curlwp);
+	if (error == 0)
+		spec_io_drain(vp, sn, 1);
+	else
+		KASSERTMSG(error == ENODEV, "cancel dev=0x%lx failed with %d",
+		    (unsigned long)dev, error);
+
 	if (vp->v_type == VBLK)
 		error = bdev_close(dev, flags, mode, curlwp);
 	else
diff --git a/sys/sys/conf.h b/sys/sys/conf.h
index 16dd87e5480c..469eaa8ec29d 100644
--- a/sys/sys/conf.h
+++ b/sys/sys/conf.h
@@ -70,6 +70,7 @@ struct vnode;
  */
 struct bdevsw {
 	int		(*d_open)(dev_t, int, int, struct lwp *);
+	int		(*d_cancel)(dev_t, int, int, struct lwp *);
 	int		(*d_close)(dev_t, int, int, struct lwp *);
 	void		(*d_strategy)(struct buf *);
 	int		(*d_ioctl)(dev_t, u_long, void *, int, struct lwp *);
@@ -86,6 +87,7 @@ struct bdevsw {
  */
 struct cdevsw {
 	int		(*d_open)(dev_t, int, int, struct lwp *);
+	int		(*d_cancel)(dev_t, int, int, struct lwp *);
 	int		(*d_close)(dev_t, int, int, struct lwp *);
 	int		(*d_read)(dev_t, struct uio *, int);
 	int		(*d_write)(dev_t, struct uio *, int);
@@ -115,6 +117,7 @@ devmajor_t bdevsw_lookup_major(const struct bdevsw *);
 devmajor_t cdevsw_lookup_major(const struct cdevsw *);
 
 #define	dev_type_open(n)	int n (dev_t, int, int, struct lwp *)
+#define	dev_type_cancel(n)	int n (dev_t, int, int, struct lwp *)
 #define	dev_type_close(n)	int n (dev_t, int, int, struct lwp *)
 #define	dev_type_read(n)	int n (dev_t, struct uio *, int)
 #define	dev_type_write(n)	int n (dev_t, struct uio *, int)
@@ -165,6 +168,7 @@ paddr_t	nommap(dev_t, off_t, int);
 /* device access wrappers. */
 
 dev_type_open(bdev_open);
+dev_type_cancel(bdev_cancel);
 dev_type_close(bdev_close);
 dev_type_strategy(bdev_strategy);
 dev_type_ioctl(bdev_ioctl);
@@ -173,6 +177,7 @@ dev_type_size(bdev_size);
 dev_type_discard(bdev_discard);
 
 dev_type_open(cdev_open);
+dev_type_cancel(cdev_cancel);
 dev_type_close(cdev_close);
 dev_type_read(cdev_read);
 dev_type_write(cdev_write);


Home | Main Index | Thread Index | Old Index