Subject: cloned route handling
To: None <tech-net@netbsd.org>
From: Jun-ichiro itojun Hagino <itojun@iijlab.net>
List: tech-net
Date: 01/27/2001 03:29:32
	There are couple of open PRs regarding to cloned routes.
	I'm trying to solve them, but there are way too many design
	choices.  So I try to summarize the current situation first.
	sorry for low readability.

	In the email, i'll use the word "clone parent" for the source of clone
	routes (interface routes like 10.0.0.0/24), and "clone child" for
	the cloned routes (10.0.0.1/32 generated by ARP or other things).

	See the summary at the end for differences among *BSD.

	Here are proposed changes: (1), (2) should be perfectly adequate
	change.  (3) needs some debate, I'm not sure at this moment.
	Related to (4), we may want to put DoS prevention code for redirect
	floods.

	reason behind the proposal:
	- Identification of clone children, as stated in (1).  Seems to me
	  we don't need additional flag for marking cloned children.  We
	  just need to check rtentry.rt_parent.
	- I believe current NetBSD behavior for (2) is rather confusing.  I
	  believe ARP entries should better go away when interface address goes
	  away (if really necessary we can issue ARP again to resolve again).
	- If we introduce behavior (3), PR11916 will be solved.  not sure if
	  it is okay to do this.  It seems to me that (3) has to be decided by
	  each AF (for example, it should be safe to allow overwrite of
	  incomplete ARP entry for netinet).  Not sure how it should be done.
	- Inactivity timer implementation like (4).  I believe the current
	  netbsd situation is fine.  We have code to phase out redirected
	  routes, and also have high/low watermark on # of redirected routes,
	  to avoid remote (onlink) DoS.
	- I don't like behavior (5) of freebsd/bsdi, local DoS is not nice.
	  It should be okay to clone when ICMP more fragment messages come.
	  We also have good validation code against ICMP more fragment packet,
	  as well as high/low watermark for PMTUD routes.
	- (6): not really necessary, it seems to me.

	also we will want to make sure rt_ifa do not point to obsolete
	interface address, on interface address removal.

itojun



Among 4 BSDs, there are certain difference in cloned route handling:
- NetBSD and OpenBSD routing code is very close to 4.4BSD code.
  (1) It only has RTF_CLONING for clone parent, and does not mark clone
  children.
  (2) Even when clone parent gets removed (by "ifconfig -alias" clone
  children stays there (it is hard to do since we do not mark
  clone children)
  (3) Suppose we have incomplete ARP entry for 10.0.0.1.  if we try to
  add a host route manually for 10.0.0.1, it will fail with EEXIST.
  (4) Inactivity timer for cloned children: supplied by rt_timer_add(),
  but it only provides hooks to code outside of route.c.
  sys/netinet or sys/netinet6 can give detailed control against
  inactivity timer behavior (flexible but need customized code).
  (5) in_pcbconnect() does not ask for cloned route.
  (6) clone API: rtalloc() clones route if RTF_CLONING is set on parent.
  we have no "force cloning" API.

- BSD/OS: has some tricks, implemented in sys/net/route.c.
  (1) It has RTF_CLONING for clone parent, and RTF_CLONED for children.
  rtentry.rt_parent refers parent from child.
  (2) If clone parent route gets nuked, clone children will be removed
  automatically by sys/net/route.c.
  (3) If we try to add a route manually, it can overwirte clone child
  routes or redirected routes (RTF_CLONED | RTF_DYNAMIC).
  (4) sys/net has inactivity timer for any cloned children (those generated by
  ARP, PMTUD and rtredirect), by default.  Each AFs reuse
  sys/net behavior.  All AFs use same inactivity timer duration.
  (5) in_pcbconnect() asks for cloned route.  It has very bad sideeffect: local
  denial-of-service (non-root user can fill up kernel routing table and
  make it impossible for the node to use network, just by running tons of
  sendto(2)).  Good things are that we can collect PMTUD results easier,
  inpcb.inp_route has host route so lookup is simpler, and we can verify
  inbound ICMP more fragment messages by looking at routing table (if we have
  clone child, we are sure that we have contacted the destination in the past)
  (6) clone API: rtcalloc() always clone route, regardless from RTF_CLONING
  bit on parent.  rtalloc() clones route if parent has RTF_CLONING set.
  last arg of rtalloc1() is overloaded to mean multiple things (i find it
  hard to read).

- FreeBSD: has some tricks, implemented mostly in sys/netinet/in_rmx.c.
  (1) It has RTF_CLONING and RTF_PRCLONING  for clone parent,
  and RTF_WASCLONED for children.  RTF_PRCLONING means "the protocol asks for
  cloning", but why is it separate from RTF_CLONING?  Not sure...
  rtentry.rt_parent refers parent from child.
  (2) If clone parent route gets nuked, clone children will be removed
  automatically by sys/net/route.c.
  (3) If we try to add a route manually, it can overwirte clone child
  routes.
  (4) sys/netinet/in_rmx.c has inactivity timer for cloned children
  generated by ARP or PMTUD (no timer for redirected routes).
  (5) in_pcbconnect() asks for cloned route.  It has bad sideeffect and
  some good points just like BSD/OS.
  (6) clone API: rtcalloc1() has additional argument, "ignflags".
  rtalloc() and rtalloc_ign() are supplied.  With rtalloc_ign() caller
  can ask it to ignore certain cloning flag bit (RTF_CLONING or RTF_PRCLONING).