tech-net: Re: NetBSD in BSD Router / Firewall Testing

Subject: Re: NetBSD in BSD Router / Firewall Testing
To: Mike Tancsa <mike@sentex.net>
From: Jonathan Stone <jonathan@Pescadero.dsg.stanford.edu>
List: tech-net
Date: 12/01/2006 10:49:30
As sometime principial maintaner of NetBSD's bge(4) driver, and the
author of many of the changes and chip-variant support subsequently
folded into OpenBSD's bge(4) by brad@openbsd.org, I'd like to speak
to a couple of points here.

First point is Thor's comment about variance in framesize due to inserting,
or not inserting, VLAN tags. I've always quietly assumed that
full-dupex Ethernet packets obey the orinial 10Mbit CSMA/CD minimum
packet length: in case, for example, a small frame is switched onto a
half-duplex link, such as a 100Mbit hub, or 10Mbit coax.

I beleive the UDP packets in Mike's tests are all so small that, even
with a VLAN tag added, the Ethernet payload (IPv4 header, UDP header,
10 bytes UDP payload), plus 14-byte Ethernet header, plus 4-byte CRC,
is still less than the ETHER_MIN_MTU.  If so, I don't see how
framesize is a factor, since the packets will be padded to the minimum
valid Ethernet payload in any case. OTOH, Switch forwarding PPS may
well show a marginal degradation due to VLAN insertion; but we're
still 2 or 3 orders of magnitude away from those limits.

Second point: NetBSD's bge(4) driver includes support for runtime
manual tuning of interrupt mitigation.  I chose the tuning values
based on empirical measurements of large TCP flows on bcm5700s and bcm5704s.

If my (dimming) memory serves, the default value of 0 yields
thresh-holds close to Bill Paul's original FreeBSD driver. A value of
1 yields an bge interrrupt for every two full-sized Ethernet
frames. Each increment of the sysctl knob will, roughly, halve receive
interrupt rate, up to a maximum of 5, which interrupts about every 30
to 40 full-sized TCP segments.

I personally haven't done peak packet-rate measurements with bge(4) in
years.  *However*, I can state for a fact that for ttcp-like
workloads, the NetBSD-style interrupt mitigation gives superior
throughput and lower CPU utilization than FreeBSD-6.1.  (I have discussed
various measurments pritavely with Robert Watson, Andre , and Sam Leffler
at length).

I therefore see very, very good grounds to expect that NetBSD would
show much better performance if you increase bge interrupt mitigation.
However, as interrupt mitigation increases, the lengths of
per-interrupt bursts of packets hitting ipintrq build up by a factor
of 2 for each increment in interrupt level.  I typcally run ttcp with
BGE interrupt mitigation at 4 or 5, and an ipintrq depth of 512 per
interface (2048 for 4 interfaces).  NetBSD-3.1 on a 2.4Ghz Opteron can
handle at least 320,00 packets/sec of receive TCP traffic, including
delivering the TCP traffic to userspace.  For a tinygram stream, I'd
expect you would need to make ipintrq even deeper.

On a related note: each setting of the ge-interrupt mitigation "knob"
has two values, one for per-packet limits and one for DMA-segment
limits (essentially, bytes).  I'd not be surprised if the per-packet
limits are suboptimal for traffic consisting solely of tinygrams.


That said: I see a very strong philosophical design difference between
FreeBSD's polling machinery, and the interrupt-mitigation approaches
variously implemented by Jason Thorpe in wm(4) and by myself in
bge(4).  For the workloads I care about, the design-point tradeoffs in
FreeBSD-4's polling are simply not acceptable.  I *want* kernel
softint processing to pre-empt userspace procesese, and even
kthreads. I acknowledge that my needs are, perhaps, unusual.

Even so, I'd be glad to work on improving bge(4) tuning for workloads
dominated by tinygrams.  The same packet rate as ttcp (over
400kpacket/sec on a 2.4Ghz Opteron) seems like an achievable target
--- unless there's a whole lot of CPU processing going on inside
IP-forwarding that I'm wholly unaware of.

At a recieve rate of 123Mbyte/sec per bge interface, I see roughly
5,000 interrupts per bge per second.  What interrupt rates are you
seeing for each bge device in your tests?


>NetBSD 4.99.4 (ROUTER) #1: Thu Nov 30 19:23:52 EST 2006

[snip dmesg showing Broadcom 5750 NICs; see origianl for details]


>The best I can get is about 125Kpps
>
>
>However, if I switch to the 2 bge nics (ie NON trunked mode), I get 
>close to 600 Kpps on the one stream and a max of 360Kpps when I have 
>the stream in the opposite direction going.  This is comparable to 
>the other boxes.  However, the driver did wedge and I had to ifconfig 
>down/up it to recover once during testing.


>Nov 30 19:36:21 r2-netbsd /netbsd: bge1: pcie mode=0x105000
>Nov 30 19:38:00 r2-netbsd /netbsd: bge2: pcie mode=0x105000

Oops. Those messages were for my own verification and shouldn't be in
normal builds.

>Nov 30 19:54:18 r2-netbsd /netbsd: bge: failed on len 52?
>Nov 30 19:54:49 r2-netbsd last message repeated 10930 times
>Nov 30 19:55:55 r2-netbsd last message repeated 14526 times
>Nov 30 19:56:11 r2-netbsd /netbsd: ed on len 52?
>Nov 30 19:56:11 r2-netbsd /netbsd: bge: failed on len 52?
>Nov 30 19:56:12 r2-netbsd last message repeated 719 times
>Nov 30 19:56:20 r2-netbsd /netbsd: ed on len 52?
>Nov 30 19:56:20 r2-netbsd /netbsd: bge: failed on len 52?
>Nov 30 19:56:21 r2-netbsd last message repeated 717 times

I've never seen that particular bug. I don't beleive I have any acutal
5750 chips to try to reproduce it.  I do have access to: 5700, 5701,
5705, 5704, 5721, 5752, 5714, 5715, 5780. (I have one machine with one
5752; and the 5780 is one-dualport-per HT-2000 chip, which means one
per motherboard. But for most people's purposes, the 5780/5714/5715
are indistinguishable).

I wonder, does this problem go away if you crank up interrupt mitigation?