tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: smr(9) and pool_cache_set_smr(9)



> Date: Tue, 21 Apr 2026 05:24:39 -0700
> From: Kevin Bowling <kevin.bowling%kev009.com@localhost>
> 
> I am working on making netinet MPSAFE and am using FreeBSD's locking
> model as a guide because there is a lot of nuance and it has seen
> decades of wear.  The stacks have not diverged fundamentally, and the
> approach has proven sound so far where I can apply a similar locking
> template to this stack.

Hi Kevin!  It's great to see work toward making netinet MP-safe.
Sorry to take so long to provide feedback.  This is my first pass of
review; I'm sure I'll have more detailed thoughts later once I've
digested the SMR algorithm internals.

> In doing so, it is desirable to port Jeff Roberson's smr(9) which is
> used in FreeBSD [1].  This is a concurrency primitive that generally
> compliments pserialize(9) with different tradeoffs and each are
> appropriate for different areas in the kernel.  The biggest difference
> is on the write side, where a shared sequence number (default) or
> ticks (SMR_LAZY) are used to clear lifecycle hazards that will be
> familiar to pserialize(9) users and are comparatively cheap against an
> IPI.

Let me see if I understand this correctly, since there's a lot of
acronyms and algorithms and research papers in this vicinity.  Just
correct whatever misunderstandings I have in the following:

1. smr(9) (Safe Memory Reclamation) is an API that can be backed by
   any of a variety of memory reclamation algorithms, where the basic
   idea is to guarantee that when multiple threads are handling some
   common resource, they coordinate so the resource isn't freed until
   they're all done, with different tradeoffs in coordination costs:

T1 ---+------------------------------------------------------->
       \ publish

T2 ----------+---------------+------+------------------------->
              \ lookup        \ use  \ release

T3 -------+-------------+-------------------+----------------->
           \ lookup      \ delete            \ safe to free

T4 -------------------------+--------------------------------->
                             \ lookup fails

   Algorithms or concepts in this family include passive
   serialization, RCU (read/copy/update), QSBR (quiescent-state-based
   reclamation), EBR (epoch-based reclamation), hazard pointers, and
   no doubt various others, with varying levels of specificity in .

2. We already have a few kernel APIs for algorithms of this type:

   (a) pserialize(9) -- really just RCU since patents expired; can't
       be held across sleep
   (b) psref(9) -- can be held across sleep, costs O(ncpu) +
       O(nresources) + O(nreferences) space, used for (e.g.) routes
   (c) localcount(9) -- can be held across sleep, costs O(ncpu *
       nresources) space, used for (e.g.) device drivers

   But what they all have in common is that waiting for a safe time to
   free is a synchronous blocking operation: pserialize_perform,
   psref_target_destroy, localcount_drain.  That's adequate for some
   purposes, but for others, it would be nice to gather moribund
   resources in batches to free asynchronously.

3. smr(9), together with the new pool_cache_set_smr(9), provides just
   that: a way to gather moribund resources in batches to free
   asynchronously.

   Now we can break the introduction of smr(9) down into three parts:

   (a) The underlying SMR algorithm.
   (b) The integration with pool_cache(9).
   (c) Rules for usage.

   They are all interesting but the most important, to me, is (c):
   rules for usage:

   - When should we vary the parameters?
   - How do we prove whether code is using it correctly?
   - How do we detect mistakes at run-time?

   So while I would be curious to hear a little about why GUS is a
   better choice than QSBR or EBR or whatever, and why the patch to
   subr_pool.c is so large, I'd like to see most focus first on the
   rules for usage, with examples and counterexamples.

Here are some little issues I see, before I've understood all the
details:

>      Threads enter a read section by calling smr_enter() or smr_lazy_enter().
>      Read sections should be short, and many operations are not permitted
>      while in a read section.  Specifically, kernel preemption is disabled,
>      and thus readers may not acquire blocking mutexes such as mutex(9) with
>      the MUTEX_DEFAULT type.

This is not quite right: MUTEX_DEFAULT can be adaptive/blocking or
spin (it is really a vestige of an earlier API design).  What makes
the difference is whether the IPL the mutex synchronizes with is a
hard interrupt IPL or not (IPL_SOFT* or IPL_NONE).

Notably, mutexes for synchronizing between softints _may sleep_, even
in softint context.  This has caused some trouble when using
pserialize(9), because it doesn't work if a reader sleeps, and it's
why we (a) added diagonstics to prevent sleeping and (b) added
psref(9) to hold onto things across sleeps, as a stop-gap measure
until we can eliminate the sleeps.

>      smr_enter() is used for non-lazy SMR contexts and issues a full memory
>      barrier (membar_sync()) on entry.  smr_lazy_enter() is used for lazy SMR
>      contexts (created with SMR_LAZY) and does not issue a memory barrier on
>      entry, relying instead on clock interrupts to flush store buffers.  On
>      exit, smr_exit() issues a release barrier while smr_lazy_exit() issues an
>      exit barrier.

What motivates the term `lazy'?

We have it in a few places like (legacy) lazy fpu switching, where I
understand what lazy means (storing fpu register content to memory is
deferred until it is actually necessary to do so), and in fstrans(9)
or FSYNC_LAZY, where I don't understand what it means.  And from the
man page, it sounds like the lazy version is new in your proposal for
NetBSD, so it doesn't come from FreeBSD.

Why would you choose lazy vs non-lazy SMR?

>      o   User-context callers (e.g., system calls such as bind(2) or
>          connect(2)) must raise the IPL before entering the read section.  For
>          network SMR contexts, wrap with splsoftnet() and splx():
> 
>                int s;
> 
>                s = splsoftnet();
>                smr_lazy_enter(smr);
>                /* ... read section ... */
>                smr_lazy_exit(smr);
>                splx(s);

Can we just include the allowed IPL in the smr descriptor itself, and
have it do splsoftnet inside?

	smr = smr_create("conn", 0, SMR_LAZY, IPL_SOFTNET);
	...
	s = smr_lazy_enter(smr);
	...
	smr_lazy_exit(smr, s);

Ideally, smr_enter and smr_lazy_enter would also be able to detect
whether you have tried using them at _too high_ an IPL.
Unfortunately, this is tricky if they can be configured to be used at
hard interrupt IPLs, because for spin mutexes, only the _last_
mutex_exit lowers the IPL:

	// ipl = IPL_NONE
	mutex_enter(&ipl_vm_lock);
	// ipl = IPL_VM
	mutex_enter(&ipl_sched_lock);
	// ipl = IPL_SCHED
	mutex_exit(&ipl_sched_lock);
	// ipl = IPL_SCHED, still
	mutex_exit(ipl_vm_lock);
	// ipl = IPL_NONE

So that might not work.

It would also be good, if blocking is forbidden, for smr_enter and
smr_lazy_enter to set some thread state that trips ASSERT_SLEEPABLE,
the way pserialize_read_enter/exit arranges to increment
curcpu()->ci_psz_read_depth while in the read section.  (Really I
think kpreempt_disable() ought to forbid sleeping too!)

>    Memory Ordering
>      The smr_enter() function has acquire semantics via membar_sync(), and the
>      smr_exit() function has release semantics via atomic_store_release().
> 
>      The smr_lazy_enter() function has relaxed store semantics only; it relies
>      on periodic clock interrupts to serialize with other CPUs.  The
>      smr_lazy_exit() function has release semantics via membar_exit().

Note that membar_exit and membar_enter are deprecated since netbsd-10:

https://man.NetBSD.org/NetBSD-10.x-BRANCH/membar_ops.3#DEPRECATED%20MEMORY%20BARRIERS

So new code should never use them.

Instead of membar_exit, you should just write membar_release (they are
aliases).  Instead of membar_enter, you need to figure out what you
really mean, since the history of documentation and implementation got
muddled, unfortunately!

What does it mean for (say) smr_enter to have acquire semantics?

Generally, it's not enough to say how a barrier-type operation is
related to other memory operations in the same thread.  There are
always _two_ barrier-type operations relating _four_ memory operations
in _two_ different threads:

        thread A                        thread B
        --------                        --------
     1. x[i].initialized = true;
        membar_release();
     2. *ptr = i;
     3.                                 if ((j = *ptr) == -1) goto fail;
                                        membar_acquire();
     4.                                 assert(x[i].initialized);

Or, instead of barriers, there may be synchronized memory operations:

     1. x[i].initialized = true;
     2. atomic_store_release(ptr, i);
     3.                                 if ((j = atomic_load_acquire(ptr))
                                            == -1)
                                                goto fail;
     4.                                 assert(x[i].initialized);

The relation is that _if_ operation (2) synchronizes with operation
(3) -- that is, if load (3) observes the effect of store (2) -- _then_
operation (1) happens-before operation (4) (and in this case, the
assertion passes).

So what are the four operations that get synchronized in each of the
smr (lazy) enter/exit routines?

Two of them are obvious to me: smr_exit and smr_lazy_exit must have
release semantics with respect to a concurrent smr_synchronize, so the
following happens-before relation can be proven:

        thread A                        thread B
        --------                        --------
     1. dostuff(inpcb->inp_route);
     2. smr_exit(smr);
     3.                                 smr_synchronize(smr);
     4.                                 free(inpcb) [inside pool_cache guts]

We obviously need to ensure that if (2) is observed by (3), then (1)
happens-before (4) -- that's the point of the whole exercise!

But it's not obvious to me from from the man page what the other
relations you have in mind are when you say:

   i. The smr_enter() function has acquire semantics
  ii. smr_advance() issues a release barrier before advancing.
 iii. smr_poll() issues an acquire barrier before returning.

or the somewhat cryptic sentence which seems to contradict those:

     The smr_advance(), smr_poll(), smr_wait(), and smr_synchronize()
     functions should not be assumed to have any guarantees with
     respect to memory ordering beyond what is documented in the
     source.

>               SMR_LAZY      Enable lazy (tick-based) write sequence
>                             advancement.  The write sequence advances at the
>                             rate of the system clock (typically 100-1000 Hz)
>                             rather than on every call to smr_advance().  This
>                             reduces write-side overhead at the cost of
>                             increased reclamation latency (bounded by 2 clock
>                             ticks).  The read-side entry (smr_lazy_enter())
>                             does not issue a full memory barrier, relying on
n>                             clock interrupts to serialize store buffers.

Why does this reduce write-side overhead?  I assume this is about the
_reclamation_ part of the write side, not the _publishing_ part of the
write side, right?

Can you phrase this without reference to micro-architectural hardware
implementation details like `serializing store buffers'?

> EXAMPLES

The example here is helpful and it's nice that it's essentially a
drop-in replacement for

/* reader */
pserialize_read_enter
...
pserialize_read_exit

/* writer (reclaimer) */
pserialize_perform
pool_cache_put

where the blocking pserialize_perform operation is replaced instead by
asynchronous logic inside pool_cache_put once it has been configured
with an SMR.

That said, it would also be nice to understand how smr_advance,
smr_poll, smr_wait, and smr_synchronize are meant to be used.  Can you
write some examples illustrating that too?

I would also be curious to know how the batches are chosen, and
whether there are some knobs to control the growth of memory vs the
latency of new allocations.  But that can come another day.


Home | Main Index | Thread Index | Old Index