Subject: FAST_IPsec policy refcnt: "refcount" or "TTL", but not both
To: None <tech-net@netbsd.org>
From: Jonathan Stone <jonathan@dsg.stanford.edu>
List: tech-net
Date: 05/17/2004 14:58:40
I recently ran into a bug, in my own private tree, which looks like a
bug in the SPD refcount handling in FAST_IPSEC: specifically, an
incompatibility between the per-PCB SPD cache (which needs a real
refcnt) and the derived-from-older-KAME key.c:key_timehandler(), which
treats the refcnt field as a TTL.

The symptoms in my own (mutant) tree are, broadly, that if one quickly
deletes and adds policy rules, then one quickly triggers panics where
SPD entry objects are double-freed, or are modified on the freelist.

After looking at the per-PCB SPD cache code, I believe the same
underlying problem exists for both the in-tree sys/netipsec and
sys/netkey implementations.  Briefly, the problem is that the per-PCB
cache is updated lazily; PCBs cache-slots can still hold entries to
SPD objects after the object is marked `dead' (i.e., sp->state ==
IPSEC_SPSTATE_DEAD).  Yet key_timehandler() treats the refcnt field of
`dead' policy entries as a TTL (calling  .  Doing both of those
at the same time cannot work correctly.

Suppose a policy is administratively deleted and thus marked as
`dead'. When the PCB attempts to use its cached state and compares it
against the global generation number, the PCB-cache code will notice
the generation number in the PCB cache is out-of-date, and will
rebind. However, in the meantime, key_timehandler() decrements the
refcount of all `dead' SPD objects each time it runs.  Eventually
those refcounts hit zero, and the object is freed.  If the per-PCB
cache touches the SPD entry after that, all bets are off.  (If the
object is on the free list, the freelist entry gets the refcount field
decremented; if the object has been reallocated, the PCB code smashes
the refcount field of the new object).

I'm not submitting a PR for this unless and until I can reproduce it
in a stock unmodified tree, which may take some time -- I'm trying to
fix the bug in my own tree.  But I beleive this bug is also present in
the 2.0 branch.  Meanwhile: does anyone see any flaws in my analysis?

The obvious approach to fixing this is to add a new state, halfway
between DEAD and ALIVE, in which the kernel lazily collects the
per-PCB-cache refcounts; possibly with a timeout to walk the PCB list
occasionally, flushing out references from long-inactive PCBs.

KAME's sys/netkey/key.c appears to implement this new state by
splitting key_freesp() into separate delete/unlink/free operations.
key_timehandler() then uses key_sp_unlink to move `dead' SPD entries
from the list of known SPs; further pcb-cache references are removed
via key_freesp(), which eventually frees the object once the refcnt
hits zero.

Any objections to backporting that logic to sys/netipsec, verifying it
fixes my problem, then requesting a pullup for 2.0? we can add an
explicit list of delete-pending objects later, if need be.

(Reflection on the wisdom of (mis)using a refcnt field as a ttl in the
first place left as an exercise for the reader.)