Subject: Performance issue in tcp_reass()
To: None <tech-net@netbsd.org>
From: None <ragge@ludd.luth.se>
List: tech-net
Date: 04/19/2004 16:45:45
In tcp_reass() the routine m_cat() is frequently called to append 
received mbufs to the end of a received chain.  If the chain of
received packets become long, the machine may become unusable for 
a long time if packets are lost. For example;

If running at around 3Gbit with a window size of 100Mbyte (RTT 200ms)
and one packet is lost, the machine will lock up for a loong time or
until the sending process is terminated.  This is because of that 
m_cat() is called at softintr level, and as long as new packets are
received acks are sent and new packets are clocked out.

The attached diff solves this problem by making use of the mlast
element in the ipqent struct, thus keeping a pointer to the last
mbuf in the queue.  All type of communication with packet losses
will benefit from this, but it's most visible on "long fat pipes".

Comments?

-- Ragge


Index: tcp_input.c
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_input.c,v
retrieving revision 1.193
diff -c -r1.193 tcp_input.c
*** tcp_input.c	17 Apr 2004 23:35:37 -0000	1.193
--- tcp_input.c	19 Apr 2004 14:33:40 -0000
***************
*** 340,345 ****
--- 340,347 ----
      __P((const struct ip6_hdr *, const struct tcphdr *));
  #endif
  
+ #define	TRAVERSE(x) while ((x)->m_next) (x) = (x)->m_next
+ 
  int
  tcp_reass(tp, th, m, tlen)
  	struct tcpcb *tp;
***************
*** 393,399 ****
  		if (pkt_seq == p->ipqe_seq + p->ipqe_len) {
  			p->ipqe_len += pkt_len;
  			p->ipqe_flags |= pkt_flags;
! 			m_cat(p->ipqe_m, m);
  			m = NULL;
  			tiqe = p;
  			TAILQ_REMOVE(&tp->timeq, p, ipqe_timeq);
--- 395,402 ----
  		if (pkt_seq == p->ipqe_seq + p->ipqe_len) {
  			p->ipqe_len += pkt_len;
  			p->ipqe_flags |= pkt_flags;
! 			m_cat(p->ipre_mlast, m);
! 			TRAVERSE(p->ipre_mlast);
  			m = NULL;
  			tiqe = p;
  			TAILQ_REMOVE(&tp->timeq, p, ipqe_timeq);
***************
*** 424,429 ****
--- 427,434 ----
  			q->ipqe_flags |= pkt_flags;
  			m_cat(m, q->ipqe_m);
  			q->ipqe_m = m;
+ 			q->ipre_mlast = m; /* last mbuf may have changed */
+ 			TRAVERSE(q->ipre_mlast);
  			tiqe = q;
  			TAILQ_REMOVE(&tp->timeq, q, ipqe_timeq);
  			TCP_REASS_COUNTER_INCR(&tcp_reass_prependfirst);
***************
*** 455,461 ****
  			pkt_len += q->ipqe_len;
  			pkt_flags |= q->ipqe_flags;
  			pkt_seq = q->ipqe_seq;
! 			m_cat(q->ipqe_m, m);
  			m = q->ipqe_m;
  			TCP_REASS_COUNTER_INCR(&tcp_reass_append);
  			goto free_ipqe;
--- 460,467 ----
  			pkt_len += q->ipqe_len;
  			pkt_flags |= q->ipqe_flags;
  			pkt_seq = q->ipqe_seq;
! 			m_cat(q->ipre_mlast, m);
! 			TRAVERSE(q->ipre_mlast);
  			m = q->ipqe_m;
  			TCP_REASS_COUNTER_INCR(&tcp_reass_append);
  			goto free_ipqe;
***************
*** 519,525 ****
  #endif
  			m_adj(m, overlap);
  			rcvpartdupbyte += overlap;
! 			m_cat(q->ipqe_m, m);
  			m = q->ipqe_m;
  			pkt_seq = q->ipqe_seq;
  			pkt_len += q->ipqe_len - overlap;
--- 525,532 ----
  #endif
  			m_adj(m, overlap);
  			rcvpartdupbyte += overlap;
! 			m_cat(q->ipre_mlast, m);
! 			TRAVERSE(q->ipre_mlast);
  			m = q->ipqe_m;
  			pkt_seq = q->ipqe_seq;
  			pkt_len += q->ipqe_len - overlap;
***************
*** 633,638 ****
--- 640,646 ----
  	 * Insert the new fragment queue entry into both queues.
  	 */
  	tiqe->ipqe_m = m;
+ 	tiqe->ipre_mlast = m;
  	tiqe->ipqe_seq = pkt_seq;
  	tiqe->ipqe_len = pkt_len;
  	tiqe->ipqe_flags = pkt_flags;