Subject: kern/11201: Missing ICMP redirects
To: None <gnats-bugs@gnats.netbsd.org>
From: Wolfgang Solfrank <ws@tools.de>
List: netbsd-bugs
Date: 10/12/2000 10:00:19
>Number:         11201
>Category:       kern
>Synopsis:       Missing ICMP redirects
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Oct 12 10:00:01 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator:     Wolfgang Solfrank
>Release:        NetBSD-1.3.2
>Organization:
NetBSD Hackers
>Environment:
	
System: NetBSD adsl 1.3.2 (adsl) #33: Thu Oct 12 16:21:30 MEST 2000	ws@adsl:/files/NetBSD/src/sys/arch/i386/compile/adsl i386
Architecture: i386

>Description:
	On little-endian machines, the networking code doesn't always
	send ICMP redirects.

	The sequence of events is as follows:

	The IP packet is pulled off the input queue (in ipintr).
	Immediately after this, the fields in the IP header are converted
	to host byte order.

	After determining that the packet needs to be forwarded, it is
	handed to ip_forward, which quite early makes a copy of the start
	of the packet, which will eventually be used for an ICMP message.
	Now this copy process handles mbufs with M_EXT set by just pointing
	to the same external storage as the original mbuf (this is the
	culprit of the problem, see below).

	After that there is a check for whether to send an ICMP redirect,
	the outcome of which is remembered until later (i.e. the ICMP
	message isn't sent right here).

	Then the packet is handed to ip_output, which (among other things)
	converts the fields in the IP header back to network byte order.
	It's only after the return from ip_output that eventually the
	function icmp_error gets called in order to actually send the
	ICMP redirect determined above.

	Now remember that if the original packet was in a cluster (or
	otherwise external storage), the storage for the packet data
	handed to icmp_error is the same as that handed to ip_output
	previously.  Since ip_output swapped the bytes in the IP header,
	the header in this case isn't in the byte order which icmp_error
	expects!

	icmp_error tries to avoid sending messages in response to packets
	other than the first fragment of a larger sequence.  Therefore it
	checks the offset field in the IP header, of course masking off the
	"don't fragment" and "more fragments" bits.  However, since the
	field is in the wrong byteorder, it masks off the wrong bits.
	Especially if the original packet had the "don't fragment" bit
	set (which is quite common nowadays), the code in icmp_error
	incorrectly determines that the packet isn't the first fragment
	and happily discards it.

	Note that I actually analyzed this on a 1.3.2 box (don't ask)
	but it seems that the problem is still there with 1.5H.  Can
	anyone please confirm or deny that the problem still persists
	with -current?

>How-To-Repeat:
	Install a little-endian NetBSD box as a router on a net which also
	has other routers, cause a host which uses PMTU discovery to send
	packets to the NetBSD box and watch the traffic on the wire.  Be
	surprised that you don't see any ICMP redirects.

>Fix:
	The fix is to avoid pointing to the same storage when doing the
	copying of the packet.  Since the amount of data to copy for the
	ICMP packet is small, there is no need to have external storage
	for the copy.  So it should be ok to just copy the part of the
	data to be copied into the mbuf instead of referencing the
	external storage (note that the diff below is against -current
	as of 10/12/00):

	*** uipc_mbuf.c.old	Thu Oct 12 18:39:23 2000
	--- uipc_mbuf.c	Thu Oct 12 18:54:20 2000
	***************
	*** 424,429 ****
	--- 424,430 ----
	  {
	  	struct mbuf *n, **np;
	  	int off = off0;
	+ 	int blen;
	  	struct mbuf *top;
	  	int copyhdr = 0;
	  
	***************
	*** 451,456 ****
	--- 452,458 ----
	  		*np = n;
	  		if (n == 0)
	  			goto nospace;
	+ 		blen = MLEN;
	  		if (copyhdr) {
	  			M_COPY_PKTHDR(n, m);
	  			if (len == M_COPYALL)
	***************
	*** 458,466 ****
	  			else
	  				n->m_pkthdr.len = len;
	  			copyhdr = 0;
	  		}
	  		n->m_len = min(len, m->m_len - off);
	! 		if (m->m_flags & M_EXT) {
	  			if (!deep) {
	  				n->m_data = m->m_data + off;
	  				n->m_ext = m->m_ext;
	--- 460,473 ----
	  			else
	  				n->m_pkthdr.len = len;
	  			copyhdr = 0;
	+ 			blen = MHLEN;
	  		}
	  		n->m_len = min(len, m->m_len - off);
	! 		if (n->m_len > blen) {
	! #ifdef	DIAGNOSTIC
	! 			if (!(m->m_flags & M_EXT)) {
	! 				panic("m_copym: m->m_len too large for !M_EXT");
	! #endif
	  			if (!deep) {
	  				n->m_data = m->m_data + off;
	  				n->m_ext = m->m_ext;

	BTW, shouldn't the m_copym and m_dup functions be changed to
	macros?
>Release-Note:
>Audit-Trail:
>Unformatted: