Subject: Re: NetBSD in BSD Router / Firewall Testing
To: Jonathan Stone <jonathan@Pescadero.dsg.stanford.edu>
From: Mike Tancsa <mike@sentex.net>
List: tech-net
Date: 12/01/2006 15:34:21
At 01:49 PM 12/1/2006, Jonathan Stone wrote:

>As sometime principial maintaner of NetBSD's bge(4) driver, and the
>author of many of the changes and chip-variant support subsequently
>folded into OpenBSD's bge(4) by brad@openbsd.org, I'd like to speak
>to a couple of points here.

First off, thanks for the extended insights!  This has been a most 
interesting exercise for me.



>I beleive the UDP packets in Mike's tests are all so small that, even
>with a VLAN tag added, the Ethernet payload (IPv4 header, UDP header,
>10 bytes UDP payload), plus 14-byte Ethernet header, plus 4-byte CRC,
>is still less than the ETHER_MIN_MTU.  If so, I don't see how
>framesize is a factor, since the packets will be padded to the minimum
>valid Ethernet payload in any case. OTOH, Switch forwarding PPS may
>well show a marginal degradation due to VLAN insertion; but we're
>still 2 or 3 orders of magnitude away from those limits.

Unfortunately, my budget is not so high that I can afford to have a 
high end gigE switch in my test area.  I started off with a linksys, 
which I managed to hang under moderately high loads.  I had an 
opportunity to test the Netgear and it was a pretty reasonable price 
(~$650 USD) for what it claims its capable of (17Mpps). It certainly 
hasnt locked up and I tried putting a bunch of boxes on line and 
forwarding packets as fast as all 8 of the boxes could and there 
didnt seem to be any ill effects on the switch.  Similarly, trunking, 
although a bit wonky to configure (I am far more used to Cisco land) 
at least works and doesnt seem to degrade overall performance.


>Second point: NetBSD's bge(4) driver includes support for runtime
>manual tuning of interrupt mitigation.  I chose the tuning values
>based on empirical measurements of large TCP flows on bcm5700s and bcm5704s.
>
>If my (dimming) memory serves, the default value of 0 yields
>thresh-holds close to Bill Paul's original FreeBSD driver. A value of
>1 yields an bge interrrupt for every two full-sized Ethernet
>frames. Each increment of the sysctl knob will, roughly, halve receive
>interrupt rate, up to a maximum of 5, which interrupts about every 30
>to 40 full-sized TCP segments.

I take it this is it
# sysctl -d hw.bge.rx_lvl
hw.bge.rx_lvl: BGE receive interrupt mitigation level
# sysctl hw.bge.rx_lvl
hw.bge.rx_lvl = 0
#

With ipf enabled and 10 poorly written rules.

rx_lvl      pps

0           219,181
1           229,334
2           280,508
3           328,896
4           333,585
5           346,974


Blasting for 10 seconds with the value set to 5, here is the before 
and after for netstat -i and netstat -q after doing
[4600X2-88-176]# ./netblast 192.168.44.1 500 10 10

start:             1165001022.659075049
finish:            1165001032.659352738
send calls:        5976399
send errors:       0
approx send rate:  597639
approx error rate: 0
[4600X2-88-176]#


# netstat -q
arpintrq:
         queue length: 0
         maximum queue length: 50
         packets dropped: 153
ipintrq:
         queue length: 0
         maximum queue length: 256
         packets dropped: 180561075
ip6intrq:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
atintrq1:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
atintrq2:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
clnlintrq:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
ppoediscinq:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
ppoeinq:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
# netstat -i
Name  Mtu   Network       Address              Ipkts Ierrs    Opkts Oerrs Colls
nfe0  1500  <Link>        00:13:d4:ae:9b:6b    38392   584     5517     0     0
nfe0  1500  fe80::/64     fe80::213:d4ff:fe    38392   584     5517     0     0
nfe0  1500  192.168.43/24 192.168.43.222       38392   584     5517     0     0
bge0* 1500  <Link>        00:10:18:14:15:12        0     0        0     0     0
bge1  1500  <Link>        00:10:18:14:27:d5 46026021 489390 
213541721     0     0
bge1  1500  192.168.44/24 192.168.44.223    46026021 489390 
213541721     0     0
bge1  1500  fe80::/64     fe80::210:18ff:fe 46026021 489390 
213541721     0     0
bge2  1500  <Link>        00:10:18:14:38:d2 354347890 255587 
19537142     0     0
bge2  1500  192.168.88/24 192.168.88.223    354347890 255587 
19537142     0     0
bge2  1500  fe80::/64     fe80::210:18ff:fe 354347890 255587 
19537142     0     0
wm0   1500  <Link>        00:15:17:0b:70:98 17816154    72       31     0     0
wm0   1500  fe80::/64     fe80::215:17ff:fe 17816154    72       31     0     0
wm1   1500  <Link>        00:15:17:0b:70:99     1528     0  2967696     0     0
wm1   1500  fe80::/64     fe80::215:17ff:fe     1528     0  2967696     0     0
lo0   33192 <Link>                                 3     0        3     0     0
lo0   33192 127/8         localhost                3     0        3     0     0
lo0   33192 localhost/128 ::1                      3     0        3     0     0
lo0   33192 fe80::/64     fe80::1                  3     0        3     0     0
# netstat -q
arpintrq:
         queue length: 0
         maximum queue length: 50
         packets dropped: 153
ipintrq:
         queue length: 0
         maximum queue length: 256
         packets dropped: 183066795
ip6intrq:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
atintrq1:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
atintrq2:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
clnlintrq:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
ppoediscinq:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
ppoeinq:
         queue length: 0
         maximum queue length: 256
         packets dropped: 0
# netstat -i
Name  Mtu   Network       Address              Ipkts Ierrs    Opkts Oerrs Colls
nfe0  1500  <Link>        00:13:d4:ae:9b:6b    38497   585     5596     0     0
nfe0  1500  fe80::/64     fe80::213:d4ff:fe    38497   585     5596     0     0
nfe0  1500  192.168.43/24 192.168.43.222       38497   585     5596     0     0
bge0* 1500  <Link>        00:10:18:14:15:12        0     0        0     0     0
bge1  1500  <Link>        00:10:18:14:27:d5 46026057 489390 
217012400     0     0
bge1  1500  192.168.44/24 192.168.44.223    46026057 489390 
217012400     0     0
bge1  1500  fe80::/64     fe80::210:18ff:fe 46026057 489390 
217012400     0     0
bge2  1500  <Link>        00:10:18:14:38:d2 360324326 255587 
19537143     0     0
bge2  1500  192.168.88/24 192.168.88.223    360324326 255587 
19537143     0     0
bge2  1500  fe80::/64     fe80::210:18ff:fe 360324326 255587 
19537143     0     0
wm0   1500  <Link>        00:15:17:0b:70:98 17816195    72       31     0     0
wm0   1500  fe80::/64     fe80::215:17ff:fe 17816195    72       31     0     0
wm1   1500  <Link>        00:15:17:0b:70:99     1528     0  2967696     0     0
wm1   1500  fe80::/64     fe80::215:17ff:fe     1528     0  2967696     0     0
lo0   33192 <Link>                                 3     0        3     0     0
lo0   33192 127/8         localhost                3     0        3     0     0
lo0   33192 localhost/128 ::1                      3     0        3     0     0
lo0   33192 fe80::/64     fe80::1                  3     0        3     0     0



>I therefore see very, very good grounds to expect that NetBSD would
>show much better performance if you increase bge interrupt mitigation.

Yup, it certainly seems so!



>That said: I see a very strong philosophical design difference between
>FreeBSD's polling machinery, and the interrupt-mitigation approaches
>variously implemented by Jason Thorpe in wm(4) and by myself in
>bge(4).  For the workloads I care about, the design-point tradeoffs in
>FreeBSD-4's polling are simply not acceptable.  I *want* kernel
>softint processing to pre-empt userspace procesese, and even
>kthreads. I acknowledge that my needs are, perhaps, unusual.

There are certainly tradeoffs. I guess for me in a firewall capacity, 
I want to be able to get into the box OOB when its under 
attack.  1Mpps is still considered a medium to heavy attack right 
now, but with more and more botnets out there, its only going to get 
more common place :(  I guess I would like the best of both worlds, a 
way to give priority for OOB access, be that serial console or other 
interface... But I dont see a way of doing that right now via Interrupt method.





>Even so, I'd be glad to work on improving bge(4) tuning for workloads
>dominated by tinygrams.  The same packet rate as ttcp (over
>400kpacket/sec on a 2.4Ghz Opteron) seems like an achievable target
>--- unless there's a whole lot of CPU processing going on inside
>IP-forwarding that I'm wholly unaware of.

The AMD I am testing on is just a 3800 X2 so ~ 2.0Ghz.



>At a recieve rate of 123Mbyte/sec per bge interface, I see roughly
>5,000 interrupts per bge per second.  What interrupt rates are you
>seeing for each bge device in your tests?


After 10 seconds of blasting,

# vmstat -i
interrupt                                     total     rate
cpu0 softclock                              5142870       98
cpu0 softnet                                1288284       24
cpu0 softserial                                 697        0
cpu0 timer                                  5197361      100
cpu0 FPU synch IPI                                5        0
cpu0 TLB shootdown IPI                          373        0
cpu1 timer                                  5185327       99
cpu1 FPU synch IPI                                2        0
cpu1 TLB shootdown IPI                         1290        0
ioapic0 pin 14                                 1659        0
ioapic0 pin 15                                   30        0
ioapic0 pin 3                                 44586        0
ioapic0 pin 10                              2596838       49
ioapic0 pin 5                              11767286      226
ioapic0 pin 7                                 64269        1
ioapic0 pin 4                                   697        0
Total                                      31291574      602

# vmstat -i
interrupt                                     total     rate
cpu0 softclock                              5145604       98
cpu0 softnet                                1288376       24
cpu0 softserial                                 697        0
cpu0 timer                                  5201094      100
cpu0 FPU synch IPI                                5        0
cpu0 TLB shootdown IPI                          373        0
cpu1 timer                                  5189060       99
cpu1 FPU synch IPI                                2        0
cpu1 TLB shootdown IPI                         1291        0
ioapic0 pin 14                                 1659        0
ioapic0 pin 15                                   30        0
ioapic0 pin 3                                 44664        0
ioapic0 pin 10                              2596865       49
ioapic0 pin 5                              11873637      228
ioapic0 pin 7                                 64294        1
ioapic0 pin 4                                   697        0
Total                                      31408348      603

That was with hw.bge.rx_lvl=5

 >

>I've never seen that particular bug. I don't beleive I have any acutal
>5750 chips to try to reproduce it.  I do have access to: 5700, 5701,
>5705, 5704, 5721, 5752, 5714, 5715, 5780. (I have one machine with one
>5752; and the 5780 is one-dualport-per HT-2000 chip, which means one
>per motherboard. But for most people's purposes, the 5780/5714/5715
>are indistinguishable).
>
>I wonder, does this problem go away if you crank up interrupt mitigation?

Its hard to reproduce, but if I use 2 generators to blast in one 
direction, it seems to trigger it even with the value at 5

Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec  1 10:21:29 r2-netbsd last message repeated 2 times
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec  1 10:21:29 r2-netbsd last message repeated 3 times
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec  1 10:21:29 r2-netbsd last message repeated 2 times
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec  1 10:21:30 r2-netbsd last message repeated 2365 times


With ipfilter disabled, I am able to get about 680Kpps through the 
box using 2 streams in one direction.  (As a comparison, RELENG_4 was 
able to do 950Kpps and with a faster CPU (AMD 4600), about 1.2Mpps)

Note, with all these tests, the NetBSD box is essentially locked up 
servicing interrupts


         ---Mike