Subject: Patch to make upper-layer network protocols align as needed
To: None <tech-net@netbsd.org>
From: Jason R Thorpe <thorpej@wasabisystems.com>
List: tech-net
Date: 05/11/2002 11:04:09
--i7KxW38SoMauyveo
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

One of the assumptions our network stack makes is that the IP headers
will be aligned to a 32-bit boundary when ip_input() is called.  This
places a constraint on drivers -- they must arrange for the packet to
be plopped into memory this way.  For Ethernet, this means scooting
the packet forward 2 bytes from the start of the receive buffer, so that
the IP header is suitably aligned after the 14 byte Ethernet header.

Unforunately, not all chips can do this.  Notable on the list are the
DEC Tulip, the Adaptec AIC-6915, National Semiconductor DP83815, and
even a few Gig-E chips.

To handle these devices, we currently allocate a new buffer and copy
the entire packet into the new buffer, with the packet start adjusted
for alignment.  Obviously, this has some real problems -- it consumes
memory bandwidth, and is just plain killer on performance at Gig-E
speeds.

What you really want to do is have the upper layers of the network
stack align the data as needed.  This saves you work if you are using
some other forwarding path which doesn't consult the L3 headers (e.g.
bridging), and also allows you to copy only those parts of the packet
that you need to examine (the payload after protocol headers typically
does not need to be aligned, since the data just goes into the socket
buffer, which makes no guarantees about alignment anyway).

I have written a patch that does this for IPv4 and IPv6.  The patch
includes the follwing:

	* New mbuf utility function, m_slurp().  This is like m_pullup()
	  except that it always performs the operation, and allows an
	  offset into the new buffer to be specified (this allows you to
	  leave space for link headers, in case you end up forwarding the
	  packet).

	* New macros that test for alignment of various IPv4 and IPv6
	  protocol headers.  For platforms which have no strict alignment
	  constraints (e.g. VAX, i386), these predicates always evaluate
	  to true, so that no alignment work will be done.

	* Sprinkle some assertions where m_pullup/m_pulldown is performed,
	  to make sure that the results of these operations are suitably
	  aligned.

So far, I've only had to place "make this aligned, please" in the ip_input()
and ip6_input() routines.  In TCP/UDP, etc., subsequent pullup/pulldown
operations end up being aligned as a side-effect of the IP header being
aligned.

I've tested this on an Alpha system by removing the code in the Tulip
driver which forces alignment of the packet.  Both IPv4 and IPv6 are
working fine.

I still need to change the IP fast-forward path to handle mis-aligned
IP headers -- that will be done differently than the normal IP input
path.

Yes, I know there are other ways to do this ... in particular, another
way would be to copy the headers into a buffer on the stack, and eliminate
all the places that assume the IP/TCP/UDP headers can be modified in-place.
That, however, is a pretty major undertaking, and I'd prefer to make
incremental progress.

Patch is attached.  Please comment.

-- 
        -- Jason R. Thorpe <thorpej@wasabisystems.com>

--i7KxW38SoMauyveo
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=ip-align-patch

Index: kern/uipc_mbuf.c
===================================================================
RCS file: /cvsroot/syssrc/sys/kern/uipc_mbuf.c,v
retrieving revision 1.59
diff -u -r1.59 uipc_mbuf.c
--- kern/uipc_mbuf.c	2002/03/09 01:46:33	1.59
+++ kern/uipc_mbuf.c	2002/05/11 17:38:00
@@ -729,6 +729,56 @@
 }
 
 /*
+ * Like m_pullup(), except a new mbuf is always allocated, and we allow
+ * the amount of empty space before the data in the new mbuf to be specified
+ * (in the event that the caller expects to prepend later).
+ */
+int MSFail;
+
+struct mbuf *
+m_slurp(struct mbuf *n, int len, int dstoff)
+{
+	struct mbuf *m;
+	int count, space;
+
+	if (len > (MHLEN - dstoff))
+		goto bad;
+	MGET(m, M_DONTWAIT, n->m_type);
+	if (m == NULL)
+		goto bad;
+	m->m_len = 0;
+	if (n->m_flags & M_PKTHDR) {
+		M_COPY_PKTHDR(m, n);
+		n->m_flags &= ~M_PKTHDR;
+	}
+	m->m_data += dstoff;
+	space = &m->m_dat[MLEN] - (m->m_data + m->m_len);
+	do {
+		count = min(min(max(len, max_protohdr), space), n->m_len);
+		memcpy(mtod(m, caddr_t) + m->m_len, mtod(n, caddr_t),
+		    (unsigned)count);
+		len -= count;
+		m->m_len += count;
+		n->m_len -= count;
+		space -= count;
+		if (n->m_len)
+			n->m_data += count;
+		else
+			n = m_free(n);
+	} while (len > 0 && n);
+	if (len > 0) {
+		(void) m_free(m);
+		goto bad;
+	}
+	m->m_next = n;
+	return (m);
+ bad:
+	m_freem(n);
+	MSFail++;
+	return (NULL);
+}
+
+/*
  * Partition an mbuf chain in two pieces, returning the tail --
  * all but the first len0 bytes.  In case of failure, it returns NULL and
  * attempts to restore the chain to its original state.
Index: sys/mbuf.h
===================================================================
RCS file: /cvsroot/syssrc/sys/sys/mbuf.h,v
retrieving revision 1.65
diff -u -r1.65 mbuf.h
--- sys/mbuf.h	2002/05/02 16:22:45	1.65
+++ sys/mbuf.h	2002/05/11 17:38:02
@@ -651,6 +651,7 @@
 struct	mbuf *m_prepend __P((struct mbuf *,int,int));
 struct	mbuf *m_pulldown __P((struct mbuf *, int, int, int *));
 struct	mbuf *m_pullup __P((struct mbuf *, int));
+struct	mbuf *m_slurp __P((struct mbuf *, int, int));
 struct	mbuf *m_split __P((struct mbuf *,int,int));
 void	m_adj __P((struct mbuf *, int));
 void	m_cat __P((struct mbuf *,struct mbuf *));
Index: netinet/icmp_var.h
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/icmp_var.h,v
retrieving revision 1.19
diff -u -r1.19 icmp_var.h
--- netinet/icmp_var.h	2001/10/30 06:41:09	1.19
+++ netinet/icmp_var.h	2002/05/11 17:38:03
@@ -83,6 +83,12 @@
 
 #ifdef _KERNEL
 struct	icmpstat icmpstat;
+
+#ifdef __NO_STRICT_ALIGNMENT
+#define	ICMP_HDR_ALIGNED_P(ic)	1
+#else
+#define	ICMP_HDR_ALIGNED_P(ic)	((((vaddr_t) (ic)) & 3) == 0)
 #endif
+#endif /* _KERNEL_ */
 
 #endif /* _NETINET_ICMP_VAR_H_ */
Index: netinet/igmp.c
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/igmp.c,v
retrieving revision 1.27
diff -u -r1.27 igmp.c
--- netinet/igmp.c	2001/11/13 00:32:35	1.27
+++ netinet/igmp.c	2002/05/11 17:38:04
@@ -179,6 +179,7 @@
 	m->m_data += iphlen;
 	m->m_len -= iphlen;
 	igmp = mtod(m, struct igmp *);
+	/* No need to assert alignment here. */
 	if (in_cksum(m, ip->ip_len - iphlen)) {
 		++igmpstat.igps_rcv_badsum;
 		m_freem(m);
Index: netinet/igmp_var.h
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/igmp_var.h,v
retrieving revision 1.11
diff -u -r1.11 igmp_var.h
--- netinet/igmp_var.h	1999/11/19 10:41:42	1.11
+++ netinet/igmp_var.h	2002/05/11 17:38:04
@@ -75,6 +75,12 @@
  */
 #define	IGMP_RANDOM_DELAY(X)	(random() % (X) + 1)
 
+#ifdef __NO_STRICT_ALIGNMENT
+#define	IGMP_HDR_ALIGNED_P(ig)	1
+#else
+#define	IGMP_HDR_ALIGNED_P(ig)	((((vaddr_t) (ig)) & 3) == 0)
+#endif
+
 void	igmp_init __P((void));
 void	igmp_input __P((struct mbuf *, ...));
 void	igmp_joingroup __P((struct in_multi *));
Index: netinet/ip_icmp.c
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/ip_icmp.c,v
retrieving revision 1.66
diff -u -r1.66 ip_icmp.c
--- netinet/ip_icmp.c	2001/11/13 00:32:37	1.66
+++ netinet/ip_icmp.c	2002/05/11 17:38:05
@@ -419,6 +419,7 @@
 	m->m_len -= hlen;
 	m->m_data += hlen;
 	icp = mtod(m, struct icmp *);
+	/* Don't need to assert alignment, here. */
 	if (in_cksum(m, icmplen)) {
 		icmpstat.icps_checksum++;
 		goto freeit;
Index: netinet/ip_input.c
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/ip_input.c,v
retrieving revision 1.148
diff -u -r1.148 ip_input.c
--- netinet/ip_input.c	2002/05/07 02:59:38	1.148
+++ netinet/ip_input.c	2002/05/11 17:38:07
@@ -418,10 +418,24 @@
 	if (TAILQ_FIRST(&in_ifaddr) == 0)
 		goto bad;
 	ipstat.ips_total++;
-	if (m->m_len < sizeof (struct ip) &&
-	    (m = m_pullup(m, sizeof (struct ip))) == 0) {
-		ipstat.ips_toosmall++;
-		return;
+	/*
+	 * If the IP header is not aligned, slurp it up into a new
+	 * mbuf with space for link headers, in the event we forward
+	 * it.  Otherwise, if it is aligned, make sure the entire
+	 * base IP header is in the first mbuf of the chain.
+	 */
+	if (IP_HDR_ALIGNED_P(mtod(m, caddr_t)) == 0) {
+		if ((m = m_slurp(m, sizeof(struct ip),
+				 (max_linkhdr + 3) & ~3)) == NULL) {
+			/* XXXJRT new stat, please */
+			ipstat.ips_toosmall++;
+			return;
+		}
+	} else if (__predict_false(m->m_len < sizeof (struct ip))) {
+		if ((m = m_pullup(m, sizeof (struct ip))) == NULL) {
+			ipstat.ips_toosmall++;
+			return;
+		}
 	}
 	ip = mtod(m, struct ip *);
 	if (ip->ip_v != IPVERSION) {
@@ -474,8 +488,9 @@
 	default:
 		/* Must compute it ourselves. */
 		INET_CSUM_COUNTER_INCR(&ip_swcsum);
-		if (in_cksum(m, hlen) != 0)
+		if (in_cksum(m, hlen) != 0) {
 			goto bad;
+		}
 		break;
 	}
 
Index: netinet/ip_var.h
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/ip_var.h,v
retrieving revision 1.47
diff -u -r1.47 ip_var.h
--- netinet/ip_var.h	2002/05/07 02:59:38	1.47
+++ netinet/ip_var.h	2002/05/11 17:38:07
@@ -183,6 +183,12 @@
 #define	IP_ALLOWBROADCAST	SO_BROADCAST	/* can send broadcast packets */
 #define	IP_MTUDISC		0x0400		/* Path MTU Discovery; set DF */
 
+#ifdef __NO_STRICT_ALIGNMENT
+#define	IP_HDR_ALIGNED_P(ip)	1
+#else
+#define	IP_HDR_ALIGNED_P(ip)	((((vaddr_t) (ip)) & 3) == 0)
+#endif
+
 extern struct ipstat ipstat;		/* ip statistics */
 extern LIST_HEAD(ipqhead, ipq) ipq;	/* ip reass. queue */
 extern u_int16_t ip_id;			/* ip packet ctr, for ids */
Index: netinet/tcp_input.c
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/tcp_input.c,v
retrieving revision 1.141
diff -u -r1.141 tcp_input.c
--- netinet/tcp_input.c	2002/05/07 02:59:38	1.141
+++ netinet/tcp_input.c	2002/05/11 17:38:13
@@ -865,6 +865,8 @@
 		return;
 	}
 
+	KASSERT(TCP_HDR_ALIGNED_P(th));
+
 	/*
 	 * Check that TCP offset makes sense,
 	 * pull out TCP options and adjust length.		XXX
@@ -915,6 +917,7 @@
 		 * (as they're before toff) and we don't need to update those.
 		 */
 #endif
+		KASSERT(TCP_HDR_ALIGNED_P(th));
 		optlen = off - sizeof (struct tcphdr);
 		optp = ((caddr_t)th) + sizeof(struct tcphdr);
 		/* 
Index: netinet/tcp_var.h
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/tcp_var.h,v
retrieving revision 1.89
diff -u -r1.89 tcp_var.h
--- netinet/tcp_var.h	2002/03/15 09:25:42	1.89
+++ netinet/tcp_var.h	2002/05/11 17:38:22
@@ -675,6 +675,12 @@
 	{ 1, 0, &tcp_delack_ticks },		\
 }
 
+#ifdef __NO_STRICT_ALIGNMENT
+#define	TCP_HDR_ALIGNED_P(th)	1
+#else
+#define	TCP_HDR_ALIGNED_P(th)	((((vaddr_t) (th)) & 3) == 0)
+#endif
+
 int	 tcp_attach __P((struct socket *));
 void	 tcp_canceltimers __P((struct tcpcb *));
 struct tcpcb *
Index: netinet/udp_usrreq.c
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/udp_usrreq.c,v
retrieving revision 1.92
diff -u -r1.92 udp_usrreq.c
--- netinet/udp_usrreq.c	2001/12/21 02:51:08	1.92
+++ netinet/udp_usrreq.c	2002/05/11 17:38:23
@@ -259,6 +259,7 @@
 		return;
 	}
 #endif
+	KASSERT(UDP_HDR_ALIGNED_P(uh));
 
 	/* destination port of 0 is illegal, based on RFC768. */
 	if (uh->uh_dport == 0)
@@ -415,6 +416,7 @@
 		return IPPROTO_DONE;
 	}
 #endif
+	KASSERT(UDP_HDR_ALIGNED_P(uh));
 	ulen = ntohs((u_short)uh->uh_ulen);
 	/*
 	 * RFC2675 section 4: jumbograms will have 0 in the UDP header field,
Index: netinet/udp_var.h
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/udp_var.h,v
retrieving revision 1.17
diff -u -r1.17 udp_var.h
--- netinet/udp_var.h	1999/11/20 00:38:00	1.17
+++ netinet/udp_var.h	2002/05/11 17:38:25
@@ -88,6 +88,12 @@
 struct	inpcbtable udbtable;
 struct	udpstat udpstat;
 
+#ifdef __NO_STRICT_ALIGNMENT
+#define	UDP_HDR_ALIGNED_P(uh)	1
+#else
+#define	UDP_HDR_ALIGNED_P(uh)	((((vaddr_t) (uh)) & 3) == 0)
+#endif
+
 void	 *udp_ctlinput __P((int, struct sockaddr *, void *));
 void	 udp_init __P((void));
 void	 udp_input __P((struct mbuf *, ...));
Index: netinet6/icmp6.c
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet6/icmp6.c,v
retrieving revision 1.75
diff -u -r1.75 icmp6.c
--- netinet6/icmp6.c	2002/03/05 08:13:56	1.75
+++ netinet6/icmp6.c	2002/05/11 17:38:29
@@ -470,6 +470,7 @@
 		return IPPROTO_DONE;
 	}
 #endif
+	KASSERT(IP6_HDR_ALIGNED_P(icmp6));
 	code = icmp6->icmp6_code;
 
 	if ((sum = in6_cksum(m, IPPROTO_ICMPV6, off, icmp6len)) != 0) {
Index: netinet6/ip6_input.c
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet6/ip6_input.c,v
retrieving revision 1.51
diff -u -r1.51 ip6_input.c
--- netinet6/ip6_input.c	2001/12/22 01:40:03	1.51
+++ netinet6/ip6_input.c	2002/05/11 17:38:31
@@ -284,10 +284,28 @@
 	IP6_EXTHDR_CHECK(m, 0, sizeof(struct ip6_hdr), /*nothing*/);
 #endif
 
-	if (m->m_len < sizeof(struct ip6_hdr)) {
-		struct ifnet *inifp;
-		inifp = m->m_pkthdr.rcvif;
-		if ((m = m_pullup(m, sizeof(struct ip6_hdr))) == 0) {
+	/*
+	 * If the IPv6 header is not aligned, slurp it up into a new
+	 * mbuf with space for link headers, in the event we forward
+	 * it.  OTherwise, if it is aligned, make sure the entire base
+	 * IPv6 header is in the first mbuf of the chain.
+	 */
+	if (IP6_HDR_ALIGNED_P(mtod(m, caddr_t)) == 0) {
+		if ((m = m_slurp(m, sizeof(struct ip6_hdr),
+				 (max_linkhdr + 3) & ~3)) == NULL) {
+			struct ifnet *inifp;
+			inifp = m->m_pkthdr.rcvif;
+
+			/* XXXJRT new stat, please */
+			ip6stat.ip6s_toosmall++;
+			in6_ifstat_inc(inifp, ifs6_in_hdrerr);
+			return;
+		}
+	} else if (__predict_false(m->m_len < sizeof(struct ip6_hdr))) {
+		if ((m = m_pullup(m, sizeof(struct ip6_hdr))) == NULL) {
+			struct ifnet *inifp;
+			inifp = m->m_pkthdr.rcvif;
+
 			ip6stat.ip6s_toosmall++;
 			in6_ifstat_inc(inifp, ifs6_in_hdrerr);
 			return;
@@ -636,6 +654,7 @@
 			return;
 		}
 #endif
+		KASSERT(IP6_HDR_ALIGNED_P(hbh));
 		nxt = hbh->ip6h_nxt;
 
 		/*
@@ -800,6 +819,7 @@
 		return -1;
 	}
 #endif
+	KASSERT(IP6_HDR_ALIGNED_P(hbh));
 	off += hbhlen;
 	hbhlen -= sizeof(struct ip6_hbh);
 	opt = (u_int8_t *)hbh + sizeof(struct ip6_hbh);
@@ -1160,6 +1180,7 @@
 				return;
 			}
 #endif
+			KASSERT(IP6_HDR_ALIGNED_P(ip6e));
 
 			switch (nxt) {
 		        case IPPROTO_DSTOPTS:
Index: netinet6/ip6_var.h
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet6/ip6_var.h,v
retrieving revision 1.18
diff -u -r1.18 ip6_var.h
--- netinet6/ip6_var.h	2001/12/21 03:58:15	1.18
+++ netinet6/ip6_var.h	2002/05/11 17:38:32
@@ -205,6 +205,12 @@
 #define	IPV6_FORWARDING		0x02	/* most of IPv6 header exists */
 #define	IPV6_MINMTU		0x04	/* use minimum MTU (IPV6_USE_MIN_MTU) */
 
+#ifdef __NO_STRICT_ALIGNMENT
+#define	IP6_HDR_ALIGNED_P(ip)	1
+#else
+#define	IP6_HDR_ALIGNED_P(ip)	((((vaddr_t) (ip)) & 3) == 0)
+#endif
+
 extern struct	ip6stat ip6stat;	/* statistics */
 extern u_int32_t ip6_id;		/* fragment identifier */
 extern int	ip6_defhlim;		/* default hop limit */

--i7KxW38SoMauyveo--