Subject: Re: SMP re-entrancy in kernel drivers/"bottom half?"
To: David Laight <david@l8s.co.uk>
From: Jonathan Stone <jonathan@dsg.stanford.edu>
List: tech-kern
Date: 02/22/2005 16:07:23
In message <20050222235717.GL4454@snowdrop.l8s.co.uk>,
David Laight writes:

>My 'gut feel' also is that passing data between cpus is a cost hit.
>Doing all the RX processing on a single cpu probably isn't a problem,
>well not until we are talking about 8+ way systems that are doing nothing
>else.

David, I find that really, *really* frustrating, given that I've said
explicitly that receive processing is already a bottleneck for me, on
a single-CPU system.

I fully expect to have 10Gbit interfaces at work this year, and at
home next year (I had fibre gigabit NICs at home at a similar
pricepoint).  But a single-CPU Opteron I cannot push more than some
330-odd megabytes/sec, which is about half the potential (PCI-X
limited) bandwith of a this-year PCI-X 10GbE NIC, never mind next
years' PCI-express 10Gbit NIC.

As for 8-way systems "doing nothing else": by end of year, that will
be a 4-socket Opteron motherboard with four dual-core CPUs. When I
have such a system, darn right it'll be doing nothing else.  Are we
talking past each other due to differnt assumptions about workload?



>OTOH getting the TX side to run concurrently through the stack is somewhat
>easier, and gives greater benefit - since you should be able to do all the
>work in the context of the caller.  For a 'ring based' ethernet controller
>this isn't actually that hard.

David (no sarcasm here): huh? I must be misunderstanding you, because
that seems bizarre.  How on earth do you get two CPUs running in the
upper-half context of the same process "easily" and "with greater
benefit", starting from where NetBSD is now?

The kernel profile data I have that the stalls from banging bits on
"same-day-service" 32-bit PCI register reads and writes is bad enough
that I'd consider burning a CPU to do that portion of it.  The wins
are very modest, compared to restructuring the stack to allow one CPU
to do the copyin()s and another to run the copied-in data down the
stack; but the smaller fruit is much, much lower.