Subject: Re: NetBSD and large pps
To: None <tls@rek.tjls.com>
From: Jonathan Stone <jonathan@dsg.stanford.edu>
List: tech-net
Date: 12/03/2004 11:39:15
In message <20041203150350.GA2746@panix.com>Thor Lancelot Simon writes

>On Fri, Dec 03, 2004 at 11:02:22AM +0200, Mihai CHELARU wrote:
>> 
>> Software tweaks:
>> 	- HZ 1000
>> 
>> So, I have 4000 IRQs/sec generated by scheduler. Rest of IRQs/sec is 
>
>I'm a little confused by this.  If you've set HZ=1000 (which is a very
>bad thing to set it to; the table for doing quick time computations based
>on HZ has an entry for 1024, but not for 1000), why are you getting *4000*
>interrupts per second?

Just a WAG but maybe circa 2048 interrupts on each of two CPUs?
If so, that smells like a bug to me.


Thor continues:

>Interrupt pacing or coalescing just means that the card buffers packets
>internally and only generates one interrupt every N packets, usually with
>a timer so that it generates an interrupt at least every N microseconds
>(this puts an upper bound on latency).  [...]

>Polling just ignores network interrupts completely, and enforces a strict
>latency/throughput trade-off by reading from the network device according
>to a timer.  This avoids interrupt-service overhead, at the expense of
>significant software complexity and of always making the _worst-case_
>latency decision, rather than treating the increased latency as an upper
>bound.
>
>Basically, if we knew how to set the coalescing thresholds and timers
>automatically, and we could get our interrupt code efficient enough,
>the first approach would always win, given cards that support it.  [...]

Yep. The bge driver provides a six-notch knob, one of which is (IIRC)
the original values from Bill Paul's FreeBSD driver.  The actual
values associated with the other 5 notch are, as I said elsewhere,
chosen by fairly unscientific, rough-and means to (very roughly)
double packets-per interrrupt.   you will ask, why are the values there?

I've had _very_ good results from private, ugly, and proprietary code
implementing a _very_ simple feedback loop, that does "bang-bang"
control -- full on, or full off -- on the bge knobs, based on current
CPU load.  In practice, it's worked very well.  Well enough that I
never bothered a tri-state control, that either doesn't change the
knob, increases by one notch, or decreases by one notch, based on
recent idle-CPU.  If anyone tries that, I'd be interested to hear how it works.

If you try this, you probably want to tweak the settings periodically
from hardclock(). Heavy NIC input load can, as Mihai observed, block
out userspace, and even softclock(), for extended time periods.