Subject: Re: SMP re-entrancy in kernel drivers/"bottom half?"
To: None <ragge@ludd.luth.se>
From: Jonathan Stone <jonathan@dsg.stanford.edu>
List: tech-kern
Date: 02/22/2005 13:28:11
In message <200502221408.PAA22904@father.ludd.luth.se>ragge@ludd.luth.se writes
>> 

[Network stack: CPU time in hardware interrupts versus software interrupts]

>During my tests a while ago, the receiving machine CPU were 
>
>- 20% HW interrupt
>- 30% SW interrupt
>- 50% copyout()
>
>when the receiving machine was at 100% load.  This varied a lot
>depending of socket buffer sizes, MTU sizes etc.  

Sure. But thanks very much for the numbers; they are in the ballpark I
expected.

If I had my 'druthers, I'd like to see our TCP restructured along the
same lines as the TCP in IRIX. Ten years ago, I understand IRIX could
handle 10 gigabits (GSN, basically HIPPI on steroids) with one CPU
handling interrupts and packet demux , a second running the IP/TCP
code, and a third doing the bcopy()s to userspace.

FreeBSD-5 has a fine-grained TCP, but with a `run to completion' model,
where each invocation of the networking stack processes a packet from
the NIC all the way to the top of the protocol stack. I don't know,
but I'd guess that was done to amortize the cost of context-switches
to dedicated interrupt-handling threads. Perhaps someone who knows
more would care to comment?

That's a very defensible choice, but probably not my first choice for
spreading a 10-Gigabit traffic across the two-to-eight-way SMP
systems.  I expect 10GbE and 2- to 8-way SMP will be commonplace in 6
to 12 months, as dual-core CPUs become commonplace in the x86 mass
server market (as if "server" niche and mass market isn't a contradiction
in terms.)

There are some important caveats, about the relative costs of context
switches on modern CPU architectures with address-space IDs (ASIDs) or
equivalent, versus the x86 "blow away the entire TLB/cache", but lets
leave those for now.

Anyway: your numbers suggest we could get maybe 25% more throughput,
by moving the 20% of hardware interrupt to a second CPU. Maybe a
little more, if we move the actual output and register banging to the
second CPU (rework the send routines to just enqueue packets on a
queue, and process that queue on the CPU handing the receive
interrupts).  Maybe a little more from splitting the cache
footprint. Less lock-contention overhead, of course.


I find that a 2.4GHz Opteron with DDR400 can sustain somewhere around
330 Mbyte/sec[*], with standard MTU, heavy interrupt mitigation (using
bge(4) sysctls) before the CPU is saturated.  Not encouraging, given
10GbE is dropping in price, and both Intel and AMD are promising
dual-core CPUs rather than exponenitally-faster CPUs I am really
curious to see how much Yamamoto-san's patch can gain on a dual
Opteron.

[*] On my mutant kernel-source tree. May not reflect stock NetBSD-2.x. YMMV.


>Note that the HW interrupts also includes transmit interrupt,
>and that the transmitted ACKs are embedded in the SW stuff.

Mmm. If I could get hardware designers to do what I want[*] there
wouldn't _be_ a transmit-done interrupt in the normal case. Either
poll during Rx-done interrupts, or periodically during output calls
for fresh transmissions.  Add a watchdog-like timer (0.5 Hz or so) for
periodic transmit-only traffic (non-flood pings, UDP queries, etc) and
you're done.

ISTR Jason once said the wm(4) hardware was designed to work rather
like that, though I dunno if our wm(4) driver goes as far as that.
And if (dim) memory serves, certain Big Iron Unices handled their Big
Iron NICs that way, too.


[*] That was the one big advantage of building bleeding-edge hardware
ten years ago, for what is now called RDMA.  Though back then, the
paper reviewers said basically: ``Not interesting. Who needs TCP to
run at memory speeds?''.  So it goes...