Re: No buffer space available

To: Tom Ivar Helbekkmo <tih%hamartun.priv.no@localhost>
Subject: Re: No buffer space available
From: Erik Fair <fair%netbsd.org@localhost>
Date: Tue, 2 Sep 2014 22:28:46 -0700
Network hangs are insidious. [old fart story time]

The headscratcher for me was the one in the 1990's at apple.com (when apple.com 
was a DEC VAX-8650 running 4.3BSD) that led me to discover TCP_SYN attacks and 
report that to the CERT two years before panix.com was attacked in the same 
way. Problem: far too limited initial TCP SYN queue length (5!), and when the 
short queue was full, any new TCP connection attempts to that port failed from 
"connection timed out" (SYN packet inbound dropped because queue for that port 
is full), despite ping (ICMP) working fine.

Imagine:

"telnet localhost 25" gives "connection timed out" (wait, what? How is that 
possible?)

kill sendmail (yeah, we used sendmail back then)

telnet localhost 25 gives "connection refused" (OK, as expected)

restart sendmail

telnet localhost 25 gives "connection timed out" (WTF?!!)

Rebooting the VAX didn't clear the problem either - same behavior afterwards.

That's when I went looking to our routers to see if anything was wrong with the 
rest of our connections to the Internet.

The source of my problem was warring "default" routes in a pair of our 
exterior-facing Cisco routers (round & round a class of outbound packets went 
until TTL exceeded), but because the routers carried about 2/3rds of the full 
"default-free" Internet routing table at the time, we didn't immediately notice 
that we couldn't talk to 1/3rd of the Internet. Of course, they could all still 
send packets to us ... which is how the TCP SYN queue got full: our SYN_ACKs 
weren't getting out to that 1/3rd, and with the SYN queue full (and a 
two-minute timeout), suddenly SMTP stops accepting any other connection 
attempts.

Once I found the default route loop, I fixed it, and then watched the load on 
apple.com shoot up as the Internet started actually being able to speak to our 
SMTP server again.

My report to the CERT (then at CMU SEI) came out of first "how did this 
happen?", followed by, "wow, I could send five or six packets every two minutes 
with totally random non-responsive (non-existant!) IP source addresses to any 
particular host/TCP port combination and stop that host from being able to 
respond on that port! I could shut down E-mail at AOL! Moo hah hah! Oh, and, 
yeah, just try to trace & stop me, I dare you." [the CERT did nothing with my 
report, alas. I quietly provided it to friends at SGI and a few other places]

I also sent a somewhat oblique message to the IETF mailing list, asserting that 
a class of one-way (bidirectional communication not required) attacks existed, 
and that ISP ingress filtering of customer IP source addresses was the only way 
we'd be able to both forestall them, and trace them. That's a BCP now, but Phil 
Karn flamed me at the time for wanting to break one mode of mobile-IP. I wasn't 
graphic or explicit because that list was public, and I didn't want to provide 
a recipe for any would-be attackers until both the ingress filtering was 
deployed, and the OS companies had fixed their TCP implementations.

This all got fixed a few years later after Panix.com was attacked (though 
nowhere near as elegantly - they were really massively flooded) with the TCP 
SYN queue system we now have in NetBSD and all other responsible OSes.

The Internet is a pretty hostile network.

[/old fart story time]

How this relates: as noted in PR/7285, we have a semantic problem with our 
errors from the networking code: ENOBUFS (55) is returned for BOTH mbuf 
exhaustion, AND for "network interface queue full" (see the IFQ_MAXLEN, 
IFQ_SET_MAXLEN(), IF_QFULL() macros in /usr/include/net/if.h, and then in the 
particular network interface driver you use).

TCP is well-behaved: it just backs off and retransmits when it hits a condition 
like that, and your application probably never hears about it - though it may 
experience the condition as a performance degradation as TCP backs off.

UDP, not so much.

If your UDP-based applications are reporting that error, they're probably not 
doing anything active/adaptive about it. Some human is expected to analyze the 
situation and "deal with it" somehow. Lucky you, human. It might be time for 
you to recapitulate the TCP congestion measurement and backoff algorithms in 
your UDP application (good luck with that well-trod path to tears). Or just 
convert to TCP. Or ... fix your network (stack? interface? media? switches?), 
if you can figure out what's actually wrong.

The bad part is that without a distinct error message for "queue full", I can't 
tell you whether you really are running out of mbufs (though netstat -m will 
tell you if you've ever hit the limit, and netstat -s will tell you about some 
queues on a per-protocol basis, but I don't see counters for network interfaces 
in there, as there probably should be), or whether you're overrunning the 
network interface output queue limit, whatever that is.

In both cases, your application should take such an error as a message to back 
off and retransmit "later" (like TCP does).

The trouble with a network interface output queue full error is that it could 
be that your application is just plain transmitting faster than the network 
interface can physically go (and good luck finding that datum from the Unix 
networking API), or your interface has been flow-controlled due to congestion 
(modern gigabit Ethernet switches do that now), or, worse, the driver really is 
hanging in some odd state "for a while" (missed interrupt, perhaps? other 
hardware hiccup?) and the packets are piling up until the queue is full.

You seem to think it's that last, and it could well be - but I think you're 
going to have to instrument some code to catch it in the act to be able to 
really figure this out and be sure of your analysis.

We really should fix PR/7285 properly with the required API change: a new error 
code allocated at least amongst the BSD's, though we ought to get Linux on 
board, too (I haven't looked, but I bet they have the same problem).

An aside: one of my favorite network heartbeat monitoring tools is Network Time 
Protocol (NTP), because it (politely) polls its peers, and keeps very careful 
track of both packet losses, and transit times. Just looking at an "NTP 
billboard" (ntpq -p) can tell you quite a lot about the health of your network, 
depending upon which peers/servers you configure.

        I hope this is of some use towards solving your problem,

        Erik <fair%netbsd.org@localhost>
Prev by Date: Re: Making bpf MPSAFE (was Re: struct ifnet and ifaddr handling ...)
Next by Date: removal of rtsol{d}
Previous by Thread: Making bpf MPSAFE (was Re: struct ifnet and ifaddr handling ...)
Next by Thread: removal of rtsol{d}
Indexes:
Home | Main Index | Thread Index | Old Index