[src/trunk]: src/sys Changes to make EVFILT_PROC MP-safe:

To: source-changes-hg%NetBSD.org@localhost
Subject: [src/trunk]: src/sys Changes to make EVFILT_PROC MP-safe:
From: thorpej <thorpej%NetBSD.org@localhost>
Date: Mon, 08 Nov 2021 15:44:12 +0000
details:   https://anonhg.NetBSD.org/src/rev/dd999b0bf5a3
branches:  trunk
changeset: 1024118:dd999b0bf5a3
user:      thorpej <thorpej%NetBSD.org@localhost>
date:      Sun Oct 10 18:07:51 2021 +0000

description:
Changes to make EVFILT_PROC MP-safe:

Because the locking protocol around processes is somewhat complex
compared to other events that can be posted on kqueues, introduce
new functions for posting NOTE_EXEC, NOTE_EXIT, and NOTE_FORK,
rather than just using the generic knote() function.  These functions
KASSERT() their locking expectations, and deal with other complexities
for each situation.

knote_proc_fork(), in particiular, needs to handle NOTE_TRACK, which
requires allocation of a new knote to attach to the child process.  We
don't want to be allocating memory while holding the parent's p_lock.
Furthermore, we also have to attach the tracking note to the child
process, which means we have to acquire the child's p_lock.

So, to handle all this, we introduce some additional synchronization
infrastructure around the 'knote' structure:

- Add the ability to mark a knote as being in a state of flux.  Knotes
  in this state are guaranteed not to be detached/deleted, thus allowing
  a code path drop other locks after putting a knote in this state.

- Code paths that wish to detach/delete a knote must first check if the
  knote is in-flux.  If so, they must wait for it to quiesce.  Because
  multiple threads of execution may attempt this concurrently, a mechanism
  exists for a single LWP to claim the detach responsibility; all other
  threads simply wait for the knote to disappear before they can make
  further progress.

- When kqueue_scan() encounters an in-flux knote, it simply treats the
  situation just like encountering another thread's queue marker -- wait
  for the flux to settle and continue on.

(The "in-flux knote" idea was inspired by FreeBSD, but this works differently
from their implementation, as the two kqueue implementations have diverged
quite a bit.)

knote_proc_fork() uses this infrastructure to implement NOTE_TRACK like so:

- Attempt to put the original tracking knote into a state of flux; if this
  fails (because the note has a detach pending), we skip all processing
  (the original process has lost interest, and we simply won the race).

- Once the note is in-flux, drop the kq and forking process's locks, and
  allocate 2 knotes: one to post the NOTE_CHILD event, and one to attach
  a new NOTE_TRACK to the child process.  Notably, we do NOT go through
  kqueue_register() to do this, but rather do all of the work directly
  and KASSERT() our assumptions; this allows us to directly control our
  interaction with locks.  All memory allocations here are performed with
  KM_NOSLEEP, in order to prevent holding the original knote in-flux
  indefinitely.

- Because the NOTE_TRACK use case adds knotes to kqueues through a
  sort of back-door mechanism, we must serialize with the closing of
  the destination kqueue's file descriptor, so steal another bit from
  the kq_count field to notify other threads that a kqueue is on its
  way out to prevent new knotes from being enqueued while the close
  path detaches them.

In addition to fixing EVFILT_PROC's reliance on KERNEL_LOCK, this also
fixes a long-standing bug whereby a NOTE_CHILD event could be dropped
if the child process exited before the interested process received the
NOTE_CHILD event (the same knote would be used to deliver the NOTE_EXIT
event, and would clobber the NOTE_CHILD's 'data' field).

Add a bunch of comments to explain what's going on in various critical
sections, and sprinkle additional KASSERT()s to validate assumptions
in several more locations.

diffstat:

 sys/kern/kern_event.c |  794 +++++++++++++++++++++++++++++++++++++++++++------
 sys/kern/kern_exec.c  |   17 +-
 sys/kern/kern_exit.c  |   36 +-
 sys/kern/kern_fork.c  |    6 +-
 sys/sys/event.h       |    7 +-
 sys/sys/eventvar.h    |   20 +-
 sys/sys/proc.h        |   11 +-
 7 files changed, 762 insertions(+), 129 deletions(-)

diffs (truncated from 1309 to 300 lines):

diff -r 8544fafa9249 -r dd999b0bf5a3 sys/kern/kern_event.c
--- a/sys/kern/kern_event.c     Sun Oct 10 17:47:38 2021 +0000
+++ b/sys/kern/kern_event.c     Sun Oct 10 18:07:51 2021 +0000
@@ -1,7 +1,7 @@
-/*     $NetBSD: kern_event.c,v 1.128 2021/09/30 01:20:53 thorpej Exp $ */
+/*     $NetBSD: kern_event.c,v 1.129 2021/10/10 18:07:51 thorpej Exp $ */
 
 /*-
- * Copyright (c) 2008, 2009 The NetBSD Foundation, Inc.
+ * Copyright (c) 2008, 2009, 2021 The NetBSD Foundation, Inc.
  * All rights reserved.
  *
  * This code is derived from software contributed to The NetBSD Foundation
@@ -58,8 +58,10 @@
  * FreeBSD: src/sys/kern/kern_event.c,v 1.27 2001/07/05 17:10:44 rwatson Exp
  */
 
+#include "opt_ddb.h"
+
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: kern_event.c,v 1.128 2021/09/30 01:20:53 thorpej Exp $");
+__KERNEL_RCSID(0, "$NetBSD: kern_event.c,v 1.129 2021/10/10 18:07:51 thorpej Exp $");
 
 #include <sys/param.h>
 #include <sys/systm.h>
@@ -134,7 +136,7 @@
 };
 
 static const struct filterops proc_filtops = {
-       .f_flags = 0,
+       .f_flags = FILTEROP_MPSAFE,
        .f_attach = filt_procattach,
        .f_detach = filt_procdetach,
        .f_event = filt_proc,
@@ -177,8 +179,6 @@
 extern const struct filterops fs_filtops;      /* vfs_syscalls.c */
 extern const struct filterops sig_filtops;     /* kern_sig.c */
 
-#define KQ_FLUX_WAKEUP(kq)     cv_broadcast(&kq->kq_cv)
-
 /*
  * Table for for all system-defined filters.
  * These should be listed in the numeric order of the EVFILT_* defines.
@@ -234,10 +234,189 @@
  *             Typically,      f_event(NOTE_SUBMIT) via knote: object lock
  *                             f_event(!NOTE_SUBMIT) via knote: nothing,
  *                                     acquires/releases object lock inside.
+ *
+ * Locking rules when detaching knotes:
+ *
+ * There are some situations where knote submission may require dropping
+ * locks (see knote_proc_fork()).  In order to support this, it's possible
+ * to mark a knote as being 'in-flux'.  Such a knote is guaranteed not to
+ * be detached while it remains in-flux.  Because it will not be detached,
+ * locks can be dropped so e.g. memory can be allocated, locks on other
+ * data structures can be acquired, etc.  During this time, any attempt to
+ * detach an in-flux knote must wait until the knote is no longer in-flux.
+ * When this happens, the knote is marked for death (KN_WILLDETACH) and the
+ * LWP who gets to finish the detach operation is recorded in the knote's
+ * 'udata' field (which is no longer required for its original purpose once
+ * a knote is so marked).  Code paths that lead to knote_detach() must ensure
+ * that their LWP is the one tasked with its final demise after waiting for
+ * the in-flux status of the knote to clear.  Note that once a knote is
+ * marked KN_WILLDETACH, no code paths may put it into an in-flux state.
+ *
+ * Once the special circumstances have been handled, the locks are re-
+ * acquired in the proper order (object lock -> kq_lock), the knote taken
+ * out of flux, and any waiters are notified.  Because waiters must have
+ * also dropped *their* locks in order to safely block, they must re-
+ * validate all of their assumptions; see knote_detach_quiesce().  See also
+ * the kqueue_register() (EV_ADD, EV_DELETE) and kqueue_scan() (EV_ONESHOT)
+ * cases.
+ *
+ * When kqueue_scan() encounters an in-flux knote, the situation is
+ * treated like another LWP's list marker.
+ *
+ * LISTEN WELL: It is important to not hold knotes in flux for an
+ * extended period of time! In-flux knotes effectively block any
+ * progress of the kqueue_scan() operation.  Any code paths that place
+ * knotes in-flux should be careful to not block for indefinite periods
+ * of time, such as for memory allocation (i.e. KM_NOSLEEP is OK, but
+ * KM_SLEEP is not).
  */
 static krwlock_t       kqueue_filter_lock;     /* lock on filter lists */
 static kmutex_t                kqueue_timer_lock;      /* for EVFILT_TIMER */
 
+#define        KQ_FLUX_WAIT(kq)        (void)cv_wait(&kq->kq_cv, &kq->kq_lock)
+#define        KQ_FLUX_WAKEUP(kq)      cv_broadcast(&kq->kq_cv)
+
+static inline bool
+kn_in_flux(struct knote *kn)
+{
+       KASSERT(mutex_owned(&kn->kn_kq->kq_lock));
+       return kn->kn_influx != 0;
+}
+
+static inline bool
+kn_enter_flux(struct knote *kn)
+{
+       KASSERT(mutex_owned(&kn->kn_kq->kq_lock));
+
+       if (kn->kn_status & KN_WILLDETACH) {
+               return false;
+       }
+
+       KASSERT(kn->kn_influx < UINT_MAX);
+       kn->kn_influx++;
+
+       return true;
+}
+
+static inline bool
+kn_leave_flux(struct knote *kn)
+{
+       KASSERT(mutex_owned(&kn->kn_kq->kq_lock));
+       KASSERT(kn->kn_influx > 0);
+       kn->kn_influx--;
+       return kn->kn_influx == 0;
+}
+
+static void
+kn_wait_flux(struct knote *kn, bool can_loop)
+{
+       bool loop;
+
+       KASSERT(mutex_owned(&kn->kn_kq->kq_lock));
+
+       /*
+        * It may not be safe for us to touch the knote again after
+        * dropping the kq_lock.  The caller has let us know in
+        * 'can_loop'.
+        */
+       for (loop = true; loop && kn->kn_influx != 0; loop = can_loop) {
+               KQ_FLUX_WAIT(kn->kn_kq);
+       }
+}
+
+#define        KNOTE_WILLDETACH(kn)                                            \
+do {                                                                   \
+       (kn)->kn_status |= KN_WILLDETACH;                               \
+       (kn)->kn_kevent.udata = curlwp;                                 \
+} while (/*CONSTCOND*/0)
+
+/*
+ * Wait until the specified knote is in a quiescent state and
+ * safe to detach.  Returns true if we potentially blocked (and
+ * thus dropped our locks).
+ */
+static bool
+knote_detach_quiesce(struct knote *kn)
+{
+       struct kqueue *kq = kn->kn_kq;
+       filedesc_t *fdp = kq->kq_fdp;
+
+       KASSERT(mutex_owned(&fdp->fd_lock));
+
+       mutex_spin_enter(&kq->kq_lock);
+       /*
+        * There are two cases where we might see KN_WILLDETACH here:
+        *
+        * 1. Someone else has already started detaching the knote but
+        *    had to wait for it to settle first.
+        *
+        * 2. We had to wait for it to settle, and had to come back
+        *    around after re-acquiring the locks.
+        *
+        * When KN_WILLDETACH is set, we also set the LWP that claimed
+        * the prize of finishing the detach in the 'udata' field of the
+        * knote (which will never be used again for its usual purpose
+        * once the note is in this state).  If it doesn't point to us,
+        * we must drop the locks and let them in to finish the job.
+        *
+        * Otherwise, once we have claimed the knote for ourselves, we
+        * can finish waiting for it to settle.  The is the only scenario
+        * where touching a detaching knote is safe after dropping the
+        * locks.
+        */
+       if ((kn->kn_status & KN_WILLDETACH) != 0 &&
+           kn->kn_kevent.udata != curlwp) {
+               /*
+                * N.B. it is NOT safe for us to touch the knote again
+                * after dropping the locks here.  The caller must go
+                * back around and re-validate everything.  However, if
+                * the knote is in-flux, we want to block to minimize
+                * busy-looping.
+                */
+               mutex_exit(&fdp->fd_lock);
+               if (kn_in_flux(kn)) {
+                       kn_wait_flux(kn, false);
+                       mutex_spin_exit(&kq->kq_lock);
+                       return true;
+               }
+               mutex_spin_exit(&kq->kq_lock);
+               preempt_point();
+               return true;
+       }
+       /*
+        * If we get here, we know that we will be claiming the
+        * detach responsibilies, or that we already have and
+        * this is the second attempt after re-validation.
+        */
+       KASSERT((kn->kn_status & KN_WILLDETACH) == 0 ||
+               kn->kn_kevent.udata == curlwp);
+       /*
+        * Similarly, if we get here, either we are just claiming it
+        * and may have to wait for it to settle, or if this is the
+        * second attempt after re-validation that no other code paths
+        * have put it in-flux.
+        */
+       KASSERT((kn->kn_status & KN_WILLDETACH) == 0 ||
+               kn_in_flux(kn) == false);
+       KNOTE_WILLDETACH(kn);
+       if (kn_in_flux(kn)) {
+               mutex_exit(&fdp->fd_lock);
+               kn_wait_flux(kn, true);
+               /*
+                * It is safe for us to touch the knote again after
+                * dropping the locks, but the caller must still
+                * re-validate everything because other aspects of
+                * the environment may have changed while we blocked.
+                */
+               KASSERT(kn_in_flux(kn) == false);
+               mutex_spin_exit(&kq->kq_lock);
+               return true;
+       }
+       mutex_spin_exit(&kq->kq_lock);
+
+       return false;
+}
+
 static int
 filter_attach(struct knote *kn)
 {
@@ -577,24 +756,9 @@
 filt_procattach(struct knote *kn)
 {
        struct proc *p;
-       struct lwp *curl;
-
-       curl = curlwp;
 
        mutex_enter(&proc_lock);
-       if (kn->kn_flags & EV_FLAG1) {
-               /*
-                * NOTE_TRACK attaches to the child process too early
-                * for proc_find, so do a raw look up and check the state
-                * explicitly.
-                */
-               p = proc_find_raw(kn->kn_id);
-               if (p != NULL && p->p_stat != SIDL)
-                       p = NULL;
-       } else {
-               p = proc_find(kn->kn_id);
-       }
-
+       p = proc_find(kn->kn_id);
        if (p == NULL) {
                mutex_exit(&proc_lock);
                return ESRCH;
@@ -606,7 +770,7 @@
         */
        mutex_enter(p->p_lock);
        mutex_exit(&proc_lock);
-       if (kauth_authorize_process(curl->l_cred,
+       if (kauth_authorize_process(curlwp->l_cred,
            KAUTH_PROCESS_KEVENT_FILTER, p, NULL, NULL, NULL) != 0) {
                mutex_exit(p->p_lock);
                return EACCES;
@@ -616,13 +780,11 @@
        kn->kn_flags |= EV_CLEAR;       /* automatically set */
 
        /*
-        * internal flag indicating registration done by kernel
+        * NOTE_CHILD is only ever generated internally; don't let it
+        * leak in from user-space.  See knote_proc_fork_track().
         */
-       if (kn->kn_flags & EV_FLAG1) {
-               kn->kn_data = kn->kn_sdata;     /* ppid */
-               kn->kn_fflags = NOTE_CHILD;
-               kn->kn_flags &= ~EV_FLAG1;
-       }
+       kn->kn_sfflags &= ~NOTE_CHILD;
+
        SLIST_INSERT_HEAD(&p->p_klist, kn, kn_selnext);
        mutex_exit(p->p_lock);
 
@@ -642,91 +804,350 @@
 static void
 filt_procdetach(struct knote *kn)
 {
+       struct kqueue *kq = kn->kn_kq;
        struct proc *p;
 
-       if (kn->kn_status & KN_DETACHED)
-               return;
-
-       p = kn->kn_obj;
-
-       mutex_enter(p->p_lock);
-       SLIST_REMOVE(&p->p_klist, kn, knote, kn_selnext);
Prev by Date: [src/trunk]: src/sys/ddb Add a "show kqueue <addr>" command.
Next by Date: [src/trunk]: src/tests/kernel/kqueue The knotes for EVFILT_SIGNAL and EVFILT_...
Previous by Thread: [src/trunk]: src/sys/ddb Add a "show kqueue <addr>" command.
Next by Thread: [src/trunk]: src/tests/kernel/kqueue The knotes for EVFILT_SIGNAL and EVFILT_...
Indexes:
Home | Main Index | Thread Index | Old Index