Subject: Re: NetBSD in BSD Router / Firewall Testing
To: Mike Tancsa <mike@sentex.net>
From: None <jonathan@dsg.stanford.edu>
List: tech-net
Date: 12/01/2006 14:31:31
In message <200612012036.kB1KaJJK053936@lava.sentex.ca>Mike Tancsa writes
>At 01:49 PM 12/1/2006, Jonathan Stone wrote:
>
>>As sometime principial maintaner of NetBSD's bge(4) driver, and the
>>author of many of the changes and chip-variant support subsequently
>>folded into OpenBSD's bge(4) by brad@openbsd.org, I'd like to speak
>>to a couple of points here.
>
>First off, thanks for the extended insights!  This has been a most 
>interesting exercise for me.

You're most welcome. (And thank you in turn for giving me a periodic
reminder that I really should write some text about interrupt
mitigation for NetBSD's bge(4) manpage.)

[[Jonathan comments that we're 2 or 3 orders of magnitude away
from where switch VLAN insertion should matter].

>Unfortunately, my budget is not so high that I can afford to have a 
>high end gigE switch in my test area.  I started off with a linksys, 
>which I managed to hang under moderately high loads.  I had an 
>opportunity to test the Netgear and it was a pretty reasonable price 
>(~$650 USD) for what it claims its capable of (17Mpps). 

Hmm, so  17Mpps versus some 0.45 Mpps is a factor of 37; lets call
it 2 and a half orders of magnitude :-/.

> Similarly, trunking, 
>although a bit wonky to configure (I am far more used to Cisco land) 
>at least works and doesnt seem to degrade overall performance.

"Trunking" is overloaded: it can be used mean either link aggregation,
or VLAN-tagging.  I have found "trunking" causese enough
misunderstandings that I avoid using the term.  I assume here you mean
insertion of VLAN tags, as e.g., commonly used for switch-to-switch
links?


>>Second point: NetBSD's bge(4) driver includes support for runtime
>>manual tuning of interrupt mitigation.  I chose the tuning values
>>based on empirical measurements of large TCP flows on bcm5700s and bcm5704s.

[....]

>hw.bge.rx_lvl = 0

Yes.  I can never remember if it's a global or per-device-instance.
(My original code was global, others have asked for per-instance).

Snipping the following...

>#
>
>With ipf enabled and 10 poorly written rules.
>
>rx_lvl      pps
>
>0           219,181
>1           229,334
>2           280,508
>3           328,896
>4           333,585
>5           346,974

I beleive the following were before-and-after stats for a 10-second
run:


>ipintrq:
>         queue length: 0
>         maximum queue length: 256
>         packets dropped: 180561075



>ipintrq:
>         queue length: 0
>         maximum queue length: 256
>         packets dropped: 183066795

Hmm. That indicates ipintrq dropped 2505720 packets during your
10-second run. Call it 250k packet drops/sec. Can you repeat your test
after increasing ipintrq via (as root)

	sysctl=-w net.inet.ip.ifq.maxlen=1024

Or even increase to 2048? As I mentioned earlier, even TCP traffic
(bidirectional ttcp streams have 1 ack ever 2 packets or a 2:1 ratio
of full-size framse to minimum-size frames), I need to configure about
512 ipintrq entries per interface. The default value of 256 isn't
really appropriate for multiple GbE interfaces using interrupt
moderation; but it is at least better than the former [ex-CSRG]
default of 50 which dated back to 10Mbit Ethernet. (Or even 3Mbit?)



>>I therefore see very, very good grounds to expect that NetBSD would
>>show much better performance if you increase bge interrupt mitigation.
>
>Yup, it certainly seems so!

I would hope NetBSD can do even better again, after attention to
runtime tunables; but see below.

>There are certainly tradeoffs. I guess for me in a firewall capacity, 
>I want to be able to get into the box OOB when its under 
>attack.  1Mpps is still considered a medium to heavy attack right 
>now, but with more and more botnets out there, its only going to get 
>more common place :(  I guess I would like the best of both worlds, a 
>way to give priority for OOB access, be that serial console or other 
>interface... But I dont see a way of doing that right now via Interrupt method.

Oh, it's doable, given patience; I've done it. The first step is to
mitigate hardware interrupts to a level where the CPU can keep up with
hardware interrupt servicing of a minimal-length traffic stream, with
CPU to spare. The second step is to tweak (or fine-tune) ipintrq max
depth to where ipintrq overflows *just* enough that procssing the
non-overflowed packets (done at spl[soft]net) don't leave you
livelocked.  On the other hand, any fastpath forwarding that bypasses
ipintrq makes that approach impossible :).


>>Even so, I'd be glad to work on improving bge(4) tuning for workloads
>>dominated by tinygrams.  The same packet rate as ttcp (over
>>400kpacket/sec on a 2.4Ghz Opteron) seems like an achievable target
>>--- unless there's a whole lot of CPU processing going on inside
>>IP-forwarding that I'm wholly unaware of.
>
>The AMD I am testing on is just a 3800 X2 so ~ 2.0Ghz.

Hmm. I can probably attempt to set up two bcm5721s in a similar box;
I'd have to look into load-generation.

>>At a recieve rate of 123Mbyte/sec per bge interface, I see roughly
>>5,000 interrupts per bge per second.  What interrupt rates are you
>>seeing for each bge device in your tests?
>

[...]
>
>That was with hw.bge.rx_lvl=5

Sorry, I didn't keep your dmesg. which interrupts were the bge devices?



>Its hard to reproduce, but if I use 2 generators to blast in one 
>direction, it seems to trigger it even with the value at 5
>
>Dec  1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?

If I'm reading -current correctly, the message indicfates that the
hardware Tx queue filled up, and therefore an outbound packet was put
onto the software queue, IFF_OACTIVE was set, in the hope that the
packet will be picked up later when the Tx queue has space available.
But for that to work, bge_start() should return whenever it's called with
OFF_ACTIVE set.  bge_start() lacks that check. bge_intr() has a check before
it calls bge_start(), but the other calls to bge_start (bge_tick()
don't do that. (Some calls check for  ifq_snd non-NULL, but that may be
a hangover from Christos' iintial import of Bill Paul's original code.

Let's talk about that offline.  if nothing else, you could try ifdef'ing
out the printf().