Source-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[src/trunk]: src Reduces the resources demanded by TCP sessions in TIME_WAIT-...



details:   https://anonhg.NetBSD.org/src/rev/c231e1897ffd
branches:  trunk
changeset: 764776:c231e1897ffd
user:      dyoung <dyoung%NetBSD.org@localhost>
date:      Tue May 03 18:28:44 2011 +0000

description:
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).

MSLT and VTW were contributed by Coyote Point Systems, Inc.

Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires.  On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.

Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer.  Corresponding to each class
is an MSL, and a session uses the MSL of its class.  The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways).  Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote.  Loopback and local sessions
expire more quickly when MSLT is used.

Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB".  VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion.  The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer.  When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.

It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.

A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive.  It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.

diffstat:

 distrib/sets/lists/comp/mi               |     3 +-
 sys/dist/pf/net/pf.c                     |     8 +-
 sys/netinet/Makefile                     |     5 +-
 sys/netinet/files.netinet                |     3 +-
 sys/netinet/in_pcb.c                     |   108 +-
 sys/netinet/in_pcb.h                     |     7 +-
 sys/netinet/in_pcb_hdr.h                 |    25 +-
 sys/netinet/tcp_input.c                  |   333 +++-
 sys/netinet/tcp_subr.c                   |    72 +-
 sys/netinet/tcp_usrreq.c                 |    81 +-
 sys/netinet/tcp_var.h                    |    14 +-
 sys/netinet/tcp_vtw.c                    |  2425 ++++++++++++++++++++++++++++++
 sys/netinet/tcp_vtw.h                    |   420 +++++
 sys/netinet/udp_usrreq.c                 |     9 +-
 sys/netinet6/in6_pcb.c                   |    94 +-
 sys/netinet6/in6_pcb.h                   |     9 +-
 sys/netinet6/in6_src.c                   |    14 +-
 sys/netinet6/ip6_input.c                 |     6 +-
 sys/netinet6/raw_ip6.c                   |    10 +-
 sys/netinet6/udp6_usrreq.c               |     6 +-
 sys/rump/net/lib/libnetinet/Makefile.inc |     6 +-
 usr.bin/netstat/Makefile                 |     4 +-
 usr.bin/netstat/inet.c                   |    85 +-
 usr.bin/netstat/inet6.c                  |    98 +-
 usr.bin/netstat/main.c                   |    55 +-
 usr.bin/netstat/netstat.h                |     6 +-
 usr.bin/netstat/vtw.c                    |   431 +++++
 usr.bin/netstat/vtw.h                    |     8 +
 28 files changed, 4200 insertions(+), 145 deletions(-)

diffs (truncated from 5565 to 300 lines):

diff -r ddd2e9439de6 -r c231e1897ffd distrib/sets/lists/comp/mi
--- a/distrib/sets/lists/comp/mi        Tue May 03 17:44:30 2011 +0000
+++ b/distrib/sets/lists/comp/mi        Tue May 03 18:28:44 2011 +0000
@@ -1,4 +1,4 @@
-#      $NetBSD: mi,v 1.1619 2011/04/20 18:55:53 haad Exp $
+#      $NetBSD: mi,v 1.1620 2011/05/03 18:28:44 dyoung Exp $
 #
 # Note: don't delete entries from here - mark them as "obsolete" instead.
 #
@@ -1614,6 +1614,7 @@
 ./usr/include/netinet/tcp_seq.h                        comp-c-include
 ./usr/include/netinet/tcp_timer.h              comp-c-include
 ./usr/include/netinet/tcp_var.h                        comp-c-include
+./usr/include/netinet/tcp_vtw.h                        comp-c-include
 ./usr/include/netinet/tcpip.h                  comp-c-include
 ./usr/include/netinet/udp.h                    comp-c-include
 ./usr/include/netinet/udp_var.h                        comp-c-include
diff -r ddd2e9439de6 -r c231e1897ffd sys/dist/pf/net/pf.c
--- a/sys/dist/pf/net/pf.c      Tue May 03 17:44:30 2011 +0000
+++ b/sys/dist/pf/net/pf.c      Tue May 03 18:28:44 2011 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: pf.c,v 1.64 2010/05/07 17:41:57 degroote Exp $ */
+/*     $NetBSD: pf.c,v 1.65 2011/05/03 18:28:45 dyoung Exp $   */
 /*     $OpenBSD: pf.c,v 1.552.2.1 2007/11/27 16:37:57 henning Exp $ */
 
 /*
@@ -37,7 +37,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: pf.c,v 1.64 2010/05/07 17:41:57 degroote Exp $");
+__KERNEL_RCSID(0, "$NetBSD: pf.c,v 1.65 2011/05/03 18:28:45 dyoung Exp $");
 
 #include "pflog.h"
 
@@ -2798,9 +2798,9 @@
 
 #ifdef __NetBSD__
 #define in_pcbhashlookup(tbl, saddr, sport, daddr, dport) \
-    in_pcblookup_connect(tbl, saddr, sport, daddr, dport)
+    in_pcblookup_connect(tbl, saddr, sport, daddr, dport, NULL)
 #define in6_pcbhashlookup(tbl, saddr, sport, daddr, dport) \
-    in6_pcblookup_connect(tbl, saddr, sport, daddr, dport, 0)
+    in6_pcblookup_connect(tbl, saddr, sport, daddr, dport, 0, NULL)
 #define in_pcblookup_listen(tbl, addr, port, zero) \
     in_pcblookup_bind(tbl, addr, port)
 #define in6_pcblookup_listen(tbl, addr, port, zero) \
diff -r ddd2e9439de6 -r c231e1897ffd sys/netinet/Makefile
--- a/sys/netinet/Makefile      Tue May 03 17:44:30 2011 +0000
+++ b/sys/netinet/Makefile      Tue May 03 18:28:44 2011 +0000
@@ -1,4 +1,4 @@
-#      $NetBSD: Makefile,v 1.19 2007/10/05 03:28:13 dyoung Exp $
+#      $NetBSD: Makefile,v 1.20 2011/05/03 18:28:45 dyoung Exp $
 
 INCSDIR= /usr/include/netinet
 
@@ -8,7 +8,8 @@
        in_var.h ip.h ip_carp.h ip6.h ip_ecn.h ip_encap.h \
        ip_icmp.h ip_mroute.h ip_var.h pim.h pim_var.h \
        tcp.h tcp_debug.h tcp_fsm.h tcp_seq.h tcp_timer.h tcp_var.h \
-       tcpip.h udp.h udp_var.h
+       tcpip.h udp.h udp_var.h \
+       tcp_vtw.h
 
 # ipfilter headers
 # XXX shouldn't be here
diff -r ddd2e9439de6 -r c231e1897ffd sys/netinet/files.netinet
--- a/sys/netinet/files.netinet Tue May 03 17:44:30 2011 +0000
+++ b/sys/netinet/files.netinet Tue May 03 18:28:44 2011 +0000
@@ -1,4 +1,4 @@
-#      $NetBSD: files.netinet,v 1.21 2010/07/13 22:16:10 rmind Exp $
+#      $NetBSD: files.netinet,v 1.22 2011/05/03 18:28:45 dyoung Exp $
 
 defflag opt_tcp_debug.h                TCP_DEBUG
 defparam opt_tcp_debug.h       TCP_NDEBUG
@@ -40,5 +40,6 @@
 file   netinet/tcp_timer.c     inet | inet6
 file   netinet/tcp_usrreq.c    inet | inet6
 file   netinet/tcp_congctl.c   inet | inet6
+file   netinet/tcp_vtw.c       inet | inet6
 
 file   netinet/udp_usrreq.c    inet | inet6
diff -r ddd2e9439de6 -r c231e1897ffd sys/netinet/in_pcb.c
--- a/sys/netinet/in_pcb.c      Tue May 03 17:44:30 2011 +0000
+++ b/sys/netinet/in_pcb.c      Tue May 03 18:28:44 2011 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: in_pcb.c,v 1.137 2009/05/12 22:22:46 elad Exp $        */
+/*     $NetBSD: in_pcb.c,v 1.138 2011/05/03 18:28:45 dyoung Exp $      */
 
 /*
  * Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project.
@@ -30,10 +30,12 @@
  */
 
 /*-
- * Copyright (c) 1998 The NetBSD Foundation, Inc.
+ * Copyright (c) 1998, 2011 The NetBSD Foundation, Inc.
  * All rights reserved.
  *
  * This code is derived from software contributed to The NetBSD Foundation
+ * by Coyote Point Systems, Inc.
+ * This code is derived from software contributed to The NetBSD Foundation
  * by Public Access Networks Corporation ("Panix").  It was developed under
  * contract to Panix by Eric Haszlakiewicz and Thor Lancelot Simon.
  *
@@ -91,7 +93,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: in_pcb.c,v 1.137 2009/05/12 22:22:46 elad Exp $");
+__KERNEL_RCSID(0, "$NetBSD: in_pcb.c,v 1.138 2011/05/03 18:28:45 dyoung Exp $");
 
 #include "opt_inet.h"
 #include "opt_ipsec.h"
@@ -137,6 +139,8 @@
 #include <netipsec/key.h>
 #endif /* IPSEC */
 
+#include <netinet/tcp_vtw.h>
+
 struct in_addr zeroin_addr;
 
 #define        INPCBHASH_PORT(table, lport) \
@@ -269,9 +273,12 @@
 
        lport = *lastport - 1;
        for (cnt = mymax - mymin + 1; cnt; cnt--, lport--) {
+               vestigial_inpcb_t vestigial;
+
                if (lport < mymin || lport > mymax)
                        lport = mymax;
-               if (!in_pcblookup_port(table, sin->sin_addr, htons(lport), 1)) {
+               if (!in_pcblookup_port(table, sin->sin_addr, htons(lport), 1,
+                                      &vestigial) && !vestigial.valid) {
                        /* We have a free port, check with the secmodel(s). */
                        sin->sin_port = lport;
                        error = kauth_authorize_network(cred,
@@ -347,6 +354,7 @@
                        return (error);
        } else {
                struct inpcb *t;
+               vestigial_inpcb_t vestige;
 #ifdef INET6
                struct in6pcb *t6;
                struct in6_addr mapped;
@@ -373,14 +381,19 @@
                mapped.s6_addr16[5] = 0xffff;
                memcpy(&mapped.s6_addr32[3], &sin->sin_addr,
                    sizeof(mapped.s6_addr32[3]));
-               t6 = in6_pcblookup_port(table, &mapped, sin->sin_port, wild);
+               t6 = in6_pcblookup_port(table, &mapped, sin->sin_port, wild, &vestige);
                if (t6 && (reuseport & t6->in6p_socket->so_options) == 0)
                        return (EADDRINUSE);
+               if (!t6 && vestige.valid) {
+                   if (!!reuseport != !!vestige.reuse_port) {
+                       return EADDRINUSE;
+                   }
+               }
 #endif
 
                /* XXX-kauth */
                if (so->so_uidinfo->ui_uid && !IN_MULTICAST(sin->sin_addr.s_addr)) {
-                       t = in_pcblookup_port(table, sin->sin_addr, sin->sin_port, 1);
+                       t = in_pcblookup_port(table, sin->sin_addr, sin->sin_port, 1, &vestige);
                        /*
                         * XXX: investigate ramifications of loosening this
                         *      restriction so that as long as both ports have
@@ -393,10 +406,22 @@
                            && (so->so_uidinfo->ui_uid != t->inp_socket->so_uidinfo->ui_uid)) {
                                return (EADDRINUSE);
                        }
+                       if (!t && vestige.valid) {
+                               if ((!in_nullhost(sin->sin_addr)
+                                    || !in_nullhost(vestige.laddr.v4)
+                                    || !vestige.reuse_port)
+                                   && so->so_uidinfo->ui_uid != vestige.uid) {
+                                       return EADDRINUSE;
+                               }
+                       }
                }
-               t = in_pcblookup_port(table, sin->sin_addr, sin->sin_port, wild);
+               t = in_pcblookup_port(table, sin->sin_addr, sin->sin_port, wild, &vestige);
                if (t && (reuseport & t->inp_socket->so_options) == 0)
                        return (EADDRINUSE);
+               if (!t
+                   && vestige.valid
+                   && !(reuseport && vestige.reuse_port))
+                       return EADDRINUSE;
 
                inp->inp_lport = sin->sin_port;
                in_pcbstate(inp, INP_BOUND);
@@ -464,6 +489,7 @@
        struct in_ifaddr *ia = NULL;
        struct sockaddr_in *ifaddr = NULL;
        struct sockaddr_in *sin = mtod(nam, struct sockaddr_in *);
+       vestigial_inpcb_t vestige;
        int error;
 
        if (inp->inp_af != AF_INET)
@@ -524,7 +550,8 @@
        }
        if (in_pcblookup_connect(inp->inp_table, sin->sin_addr, sin->sin_port,
            !in_nullhost(inp->inp_laddr) ? inp->inp_laddr : ifaddr->sin_addr,
-           inp->inp_lport) != 0)
+                                inp->inp_lport, &vestige) != 0
+           || vestige.valid)
                return (EADDRINUSE);
        if (in_nullhost(inp->inp_laddr)) {
                if (inp->inp_lport == 0) {
@@ -794,7 +821,7 @@
 
 struct inpcb *
 in_pcblookup_port(struct inpcbtable *table, struct in_addr laddr,
-    u_int lport_arg, int lookup_wildcard)
+                 u_int lport_arg, int lookup_wildcard, vestigial_inpcb_t *vp)
 {
        struct inpcbhead *head;
        struct inpcb_hdr *inph;
@@ -802,6 +829,9 @@
        int matchwild = 3, wildcard;
        u_int16_t lport = lport_arg;
 
+       if (vp)
+               vp->valid = 0;
+
        head = INPCBHASH_PORT(table, lport);
        LIST_FOREACH(inph, head, inph_lhash) {
                inp = (struct inpcb *)inph;
@@ -833,6 +863,54 @@
                                break;
                }
        }
+       if (match && matchwild == 0)
+               return match;
+
+       if (vp && table->vestige) {
+               void    *state = (*table->vestige->init_ports4)(laddr, lport_arg, lookup_wildcard);
+               vestigial_inpcb_t better;
+
+               while (table->vestige
+                      && (*table->vestige->next_port4)(state, vp)) {
+
+                       if (vp->lport != lport)
+                               continue;
+                       wildcard = 0;
+                       if (!in_nullhost(vp->faddr.v4))
+                               wildcard++;
+                       if (in_nullhost(vp->laddr.v4)) {
+                               if (!in_nullhost(laddr))
+                                       wildcard++;
+                       } else {
+                               if (in_nullhost(laddr))
+                                       wildcard++;
+                               else {
+                                       if (!in_hosteq(vp->laddr.v4, laddr))
+                                               continue;
+                               }
+                       }
+                       if (wildcard && !lookup_wildcard)
+                               continue;
+                       if (wildcard < matchwild) {
+                               better = *vp;
+                               match  = (void*)&better;
+
+                               matchwild = wildcard;
+                               if (matchwild == 0)
+                                       break;
+                       }
+               }
+
+               if (match) {
+                       if (match != (void*)&better)
+                               return match;
+                       else {
+                               *vp = better;
+                               return 0;
+                       }
+               }
+       }
+
        return (match);
 }
 
@@ -843,13 +921,17 @@
 struct inpcb *
 in_pcblookup_connect(struct inpcbtable *table,
     struct in_addr faddr, u_int fport_arg,
-    struct in_addr laddr, u_int lport_arg)
+    struct in_addr laddr, u_int lport_arg,
+    vestigial_inpcb_t *vp)
 {
        struct inpcbhead *head;
        struct inpcb_hdr *inph;
        struct inpcb *inp;
        u_int16_t fport = fport_arg, lport = lport_arg;
 
+       if (vp)
+               vp->valid = 0;
+
        head = INPCBHASH_CONNECT(table, faddr, fport, laddr, lport);



Home | Main Index | Thread Index | Old Index