tech-net archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: TCP and NET_MPSAFE
On Fri, Apr 17, 2026 at 11:30 PM Kevin Bowling <kevin.bowling%kev009.com@localhost> wrote:
>
> On Wed, Apr 15, 2026 at 4:06 PM Kevin Bowling <kevin.bowling%kev009.com@localhost> wrote:
> >
> > Looking at https://nxr.netbsd.org/xref/src/doc/TODO.smpnet there is
> > mention of TCP being protected, and I think this is referring to the
> > KERNEL_LOCK around tcp_output() and like.
> >
> > But a cursory read makes these seem unnecessary, i.e. they could be
> > dropped or switched to KERNEL_LOCK_UNLESS_NET_MPSAFE, because all the
> > callers already lock accordingly. A quick test doing so doesn't lead
> > to immediate issues with a NET_MPSAFE kernel. But it's also not a big
> > win because softnet_lock still dominates.
> >
> > Is there any recent history looking into any of this, and is anyone
> > working on it still that may want to compare notes?
>
> Here is an early attempt at an MPSAFE netinet:
> https://people.freebsd.org/~kbowling/netbsd/mpsafe-netinet-solock-v1.patch
This approach is an interesting stepping stone and good to get
familiar with the NetBSD stack but is insufficient as-is. The primary
issue is that serializing with a per socket mutex (adaptive spin) has
a degenerate case on a single TCP stream (iperf3 -R) where one core is
running the user state (copy data to socket buffer) and another is
running an expensive path like ip_input that also needs the socket
buffer exclusively. We end up spending an excessive amount of time in
softint, the starvation issue I pointed out in my broader concerns. I
am now working to further break the locks up using the FreeBSD model
as a guideline so these two operations do not spin on eachother.
Part of the interesting stepping stone is that since is done with
pserialize, which cannot sleep, we have a clear path to leap-frog
epoch(9) and instead mostly mechanically replace this with Jeff
Roberson's smr(9). SMR compliments pserialize where it has a much
better write path with no broadcast IPI wait, and can approach the
read performance on TSO architectures or when using an SMR_LAZY flag.
More to come, but this is going to get pretty complicated.
> The major changes:
> 1) Per-socket locks. This is based on Robert Watson and Gleb Smirnoff
> and other's work in FreeBSD and is fairly critical for breaking up
> stack processing in a meaningful way to allow flow level parallelism.
> 2) Safe memory reclamation using pserialize(9). This is inspired
> structurally on work Matt Macy did in FreeBSD for me at Limelight
> Networks. But NetBSD has some different tradeoffs which I will get to
> in a bit.
>
> There are some immediate concerns:
> 0) This requires NET_MPSAFE config. I have not even compiled a
> !NET_MPSAFE config with it and I have not thought through any problems
> associated with that yet. Consider that path broken with this patch
> until further effort proves otherwise.
> 1) I have only tested this in a LAN environment, on an EdgeRouter-4
> (MIPS64 Octeon). MIPS has a weak memory model so this is a good thing
> for the lockless programming, but it could break in subtle ways on
> different archs that I have not yet thought about.
> 2) TCP is a forgiving protocol, there is certainly a chance I have not
> detected my own obvious breakage because things work well enough.
> 3) TCP has a lot of corner cases, some of which are difficult to
> exercise and some of which are rarely if ever seen.
> 4) I have done no testing outside of TCP/UDP. There are some
> protocols which should probably be dropped (DCCP) and SCTP would need
> a lot more scrutiny than I have given it (maybe start witha refresh vs
> FreeBSD). But GENERIC builds and testing can guide further work.
>
> There are some broader concerns.
> 1) pserialize pays a larger and a per use penalty versus FreeBSD's
> epoch(9) and its use in netinet there. This is both algorithmic and
> structural. We imported ConcurrencyKit (https://concurrencykit.org/)
> into FreeBSD, which offers Epoch Based Reclaimation (EBR). Epoch is
> similar from a calling standpoint to pserialize, but has lower latency
> in implementation. Structurally, FreeBSD enters an epoch for the
> entire pass through the network stack, so reads are basically a couple
> of extra instructions. That could matter a lot, or it could just be a
> minor performance item. I have not spent any time looking to see if
> pserialize can be used in a similar way (structural) by adding
> scheduler awareness or improved internally (algorithmically).
> 2) TCP becomes fast enough that I can starve out SOFTINT_CLOCK on a
> single core iperf3 -R on the ER4 system (requires driver improvements
> I will publish later), and it is likely possible on other hardware
> sizes. The effects of this can be somewhat dire, for instance TCP
> timers stop working, or if you are running a watchdog it won't get
> poked in time. This needs either a rethink of softint priority, or
> moving some of the work out of softint to something the scheduler can
> rotate.
>
> Regards,
> Kevin Bowling
Home |
Main Index |
Thread Index |
Old Index