tech-net: Problems with PF_KEY SADB

Subject: Problems with PF_KEY SADB_DUMP
To: None <tech-net@netbsd.org>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-net
Date: 09/19/2003 15:46:17
Here's a summary of the current status on PF_KEY problems with
SADB_DUMP of modest-to-large SA database (at least as I see it):


* There is a consensus that NetBSD needs a correct, reliable, robust
  interface to PF_KEY; and that a kernfs-based approach (as kernfs
  is strictly optional in NetBSD) is by definition not a suitable API.
  (Bill Studenmund disagrees; Bill would like to make kernfs more standard.
  Bill has been heard, but for now that's a different issue).

* The PF_KEY API defines the SAD_DUMP so that the app sends one
  SADB_DUMP message, to which the kernel responds with multiple SADB_DUMP
  responses. Each response has one SA. Thus, SABD_DUMP cannot be reworked
  to use Matt Thomas's suggestion (do the uiomove() directly)  without
  changing the userspace API.

* There is a genuine bug in the KAME PF_KEY, which  has also been
  faithfully copied in fast-ipsec (NetBSD  and FreeBSD): if a process
  requesting an SADB_DUMP and the kernel fills the requesting so_rcv queue,
  KAME fails to place an error indication in the last-delivered packet.
  (that's why racoon hangs in sbwait(): it is waiting to read another SADB_DUMP message).

  KAME setkey has a kludge to avoid the bug: it does a setsockopt()
  with SO_RCVTIMEO, and in the loop to read subsequent SADB_DUMP respsones,
  setkey interpretes a subsequent EAGAIN as a sign to abort the loop.
  IMNSO, that's not up to the standards to which NetBSD code aspires.

  A more correct fix is to have the sendup code check whether additional
  SADB_DUMP messages are required; if more are required, and there
  isn't space for at least one more (in addition to the current
  message) then set sadb_msg_errno to (e.g.)  ENOBUFS, to indicate
  the SADB_DUMP responses are truncated at that message.
  
* A major reason we run into this is the very small size of the
  SADB_DUMP responses.  They leave about 70% of each mbuf empty. The
  nett result is that the requesting PF_KEY socket is hitting its
  sb_mbmax limit while sb_cc is still only at 70k or thereabouts (with
  the sb_hiwat limit at 256k).

  Thus, increasing the recieve queue via setsockopt (. ,SO_RCVBUF, ..)
  *on its own* doesn't help one iota (exactly as I reported to Itojun):
  SO_RCVBUF does an sbreserve(), and sb_reserve() clips the socket queue's
  sb_mbmax at sb_max (NetBSD sysctl kern.sbmax).

  To increase the number of SAs that can be returned, you have to bump
  sb_max: and bump it to values way beyond what I consider reasonable
  for general-purpose use. (Setting sb_max to 1024*1024 is still on the 
  low side for the applications I want.)

* I have verified that bumping both sb_max *and* the per-socket  receive
  queue does indeed increase the number of SAs the kernel can return,
  on both a week-old NetBSD fast-ipsec and on FreeBSD 4.x fast-ipsec.

To paraphrase another developer's private email: we may have to do
some papering-over here, but I'm not yet sure whether we paper over
the implementation, or get a ladder big enough to start papering over the spec.

Packing the SAs more densely into the socket queue would have the most
immediate pay-back (if we can do that without breaking the api?).  I'm
wondering if the long-term fix is to add an ioctl()-style API, where
we can return an atomic snapshot of the SADB, up to whatever size the
userland process has address-space for.

That's where it's at. Where do we go from here?