Re: TCP and NET_MPSAFE

To: Jason Thorpe <thorpej%me.com@localhost>
Subject: Re: TCP and NET_MPSAFE
From: Kevin Bowling <kevin.bowling%kev009.com@localhost>
Date: Mon, 4 May 2026 00:32:51 -0700
On Sat, May 2, 2026 at 8:47 AM Jason Thorpe <thorpej%me.com@localhost> wrote:
>
> Hey Kevin - sorry, I was sitting on this email thinking about it, and finally had some spare brain cycles to really pay it some proper attention.
>
> > On Apr 19, 2026, at 5:17 PM, Kevin Bowling <kevin.bowling%kev009.com@localhost> wrote:
> >
> >> 2) TCP becomes fast enough that I can starve out SOFTINT_CLOCK on a
> >> single core iperf3 -R on the ER4 system (requires driver improvements
> >> I will publish later), and it is likely possible on other hardware
> >> sizes.  The effects of this can be somewhat dire, for instance TCP
> >> timers stop working, or if you are running a watchdog it won't get
> >> poked in time.  This needs either a rethink of softint priority, or
> >> moving some of the work out of softint to something the scheduler can
> >> rotate.

Hi Jason,

I think the root problem was of my own making.  My first plunge in was
to push a mutex into the socket, i.e. each socket has its own mutex.
What would happen is the network softints would be dispatching on one
core, and the application filling on another, classic starvation.

I then studied the FreeBSD locking model carefully, iterated and
borrowed many concepts.  What I've done will look very familiar to
FreeBSD but there are some key differences because we don't have the
netisr concept.  And with smr(9) for the protocols and pserialize(9)
for the lladdr table we can maintain this svelte softint design which
might be an advantage for the broad range of hw NetBSD runs on versus
punting everything to (sleepable) user threads and FreeBSD's dual
smr(9) and sleepable epoch(9) and sleepable sx locks.

Lock order: SOCK_IO > SOCK_LOCK > INP_WLOCK > SOLISTEN > SOCKBUF_LOCK / SONOTIFY

SONOTIFY_LOCK lock sits below INP_WLOCK in the hierarchy, so protocol
input paths (tcp_input) can acquire it directly a the solock dance.

SOLISTEN_LOCK protects accept-queue state (so_q, so_q0, so_qlen,
so_q0len, so_qlimit, so_head, so_onq).  It sits below INP_WLOCK so
tcp_input can acquire it without dropping INP_WLOCK.

SOCKBUF_LOCK and SONOTIFY do not nest.  SOCKBUF_LOCK is the "object
lock" for all sb_sel operations: selrecord (sopoll, fifo_poll),
selnotify (sowakeup, sowakeup_locked), and knote attach/detach
(soo_kqfilter, filt_so*detach).
SONOTIFY_LOCK is used only for cv_broadcast/cv_timedwait on so_cv
(connect/accept/close waits).

It might seem like a lot of locks, but several of them are co-equal
and the intent in breaking apart so much is that they are never
contested in the hot path.  I might be able to combine one level once
Unix Domain and some other work I will mention in a moment is figured
out.

/*
* Kernel structure per socket.
*
* Field locking key:
*   (a) atomic operations (atomic_load_acquire etc.)
*   (k) SOCK_LOCK (so->so_lock mutex)
*   (n) SONOTIFY_LOCK (so->so_notify_lock mutex)
*   (b) SOCKBUF_LOCK (per-buffer sb_lock mutex)
*   (l) SOLISTEN_LOCK (so->so_accept_lock mutex)
*   (i) INP_WLOCK / INP_RLOCK (protocol inpcb rwlock)
*   (c) constant after creation
*   (r) only modified during attach/accept, then stable
*/
struct socket {
       kmutex_t * volatile so_lock;    /* the socket-level mutex */
       kcondvar_t      so_cv;          /* (n) notifier (interlock:
so_notify_lock) */
       kmutex_t        so_notify_lock; /* cv interlock (connect/accept/close) */
       kmutex_t        so_accept_lock; /* accept-queue interlock */
       short           so_type;        /* (c) generic type, see socket.h */
       short           so_options;     /* (k) from socket call */
       u_short         so_linger;      /* (k) time to linger while closing */
       unsigned int    so_state;       /* (a) internal state flags SS_* */
       void            *so_pcb;        /* (r) protocol control block */
       const struct protosw *so_proto; /* protocol handle */

       struct socket   *so_head;       /* (l) back pointer to accept socket */
       struct soqhead  *so_onq;        /* (l) queue (q or q0) that we're on */
       struct soqhead  so_q0;          /* (l) queue of partial connections */
       struct soqhead  so_q;           /* (l) queue of incoming connections */
       TAILQ_ENTRY(socket) so_qe;      /* (l) our queue entry (q or q0) */
       short           so_q0len;       /* (l) partials on so_q0 */
       short           so_qlen;        /* (l) number of connections on so_q */
       short           so_qlimit;      /* (l) max number queued connections */
<abbriviated>

There are two major tarpits left.  Unix Domain Sockets have existing
intricate centralized locking.  The bluetooth stack uses a protocol
level lock.  Unix sockets are definitely worth fixing, and I haven't
really looked at Bluetooth yet but if both can become granular it will
simplify the main socket code where I have to do a trylock dance and
mutex refcounting to handle these remainders.. and I'm not sure this
pattern is correct, nor worth the work to try and prove.. let's take
it all the way.  While socket and netinet were structurally similar to
FreeBSD from birth (and more so with this locking), FreeBSD has
basically rewritten Unix sockets and Bluetooth is handled by Netgraph
so there is no direct model.

I ran out of energy and was working on some other projects.  When I
have it I will try to work through Unix sockets and bluetooth, then
publish my changes for people to try.  It's a sizable diff but a lot
of it is repeatable patterns to replace old ones as seen by the
insertions vs deletions:  116 files changed, 6716 insertions(+), 3658
deletions(-)

I would say the L4 protocols are done, socket locking stuff about 90%,
listen sockets 80%.  It's been running on a laptop with a window
manager (relevant to unix sockets) and Firefox for over a week with
lockdebug.  I also try a "server" octeon for weak memory ordering
testing, listen sockets, and scalability on smaller hardware which is
similarly stable.  It would be quite interesting to see it run through
something like the test cluster.  I need to spend some time on the
test suite to see what might need more coverage.

A lengty response but a lot happened since my initial mail.

> I definitely agree that a re-think of the software interrupt priorities are in order.  It seems to me that the logical ordering should be more like:
>
>     clock > net == serial > bio
>
> And that where should be another “general” (because “clock” has served as “general” historically) at the bottom below bio (would hate for some random thing to get in the way of a page-in completion), so:
>
>     clock > net == serial > bio > gen
>
> My logic for equating “net” and “serial” goes a little something like this: today’s high-throughput network interfaces have tight performance constraints that might bump them above the servicing of UARTs, but there are cases where UARTs (or, serial line devices generally) are part of the networking stack, and putting the below networking in the priority order seems like an icky inversion.
>
> If a platform really wanted to use hardware support as the scheduling trigger and had 2 hardware levels to throw at it, I would say:
>
>     clock > net == serial >= bio >= gen
>
> I.e. clock always on top, net and serial always equal, and an opportunistic (or software-only) prioritization below that.  But I’m willing to be convinced otherwise if there’s a good argument for a different ordering.
>
> -- thorpej
>
References:
- TCP and NET_MPSAFE
  - From: Kevin Bowling
- Re: TCP and NET_MPSAFE
  - From: Kevin Bowling
- Re: TCP and NET_MPSAFE
  - From: Kevin Bowling
- Re: TCP and NET_MPSAFE
  - From: Jason Thorpe
Prev by Date: Re: TCP and NET_MPSAFE
Previous by Thread: Re: TCP and NET_MPSAFE
Next by Thread: ixv(4) for SR-IOV on Intel NICs does not work
Indexes:
Home | Main Index | Thread Index | Old Index