tech-kern: Re: splx() optimization [was Re: SMP re-eetrancy in "bottom half" drivers]

Subject: Re: splx() optimization [was Re: SMP re-eetrancy in "bottom half" drivers]
To: Bill Studenmund <wrstuden@netbsd.org>
From: Jonathan Stone <jonathan@dsg.stanford.edu>
List: tech-kern
Date: 06/09/2005 20:59:44
In message <20050610030657.GA10903@netbsd.org>,
Bill Studenmund writes:

>On Wed, Jun 08, 2005 at 09:14:39AM +0900, YAMAMOTO Takashi wrote:
[...]

>> Jonathan, i can understand your frustration.
>> however, i don't think there is a sane shortcut.
>
>I've been trying to think of one. While I agree it'd be best to do things=
>=20
>cleanly, I'd really really like to let Jonathan get working on SMP network=
>=20
>processing.
>
>I _think_ things would work if we carved out a perimeter in the network
>stack area, and required a second lock to access it. Thus either biglock
>or driver-specific code could grab it and, as I understand the issues, we
>won't run into "you can get interrupted by biglock" concerns. I realize
>that establishing such a perimeter would be a chunk of the SMP work...
>
>However I then thought about splvm.

Yep, that too.  I was more worried about the ppp code and interactions
with spltty().  

I also had a phone conversation with Stephan Uphoff. I hope to
summarize the high points of that tomorrow. Meantime, I think a
quantitative description would be useful, so here goes...

My (first) target is a undirectional TCP stream running at line rate
on 10 Gbit Ethernet.  The payload side of that stream is 800,000
packets/sec of 1500 bytes each.  The non-data, ACK-only side is
 at least another 400,000 packets. That's an aggregate packet rate
of 1.2 million packets/sec.  

Let's work through some numbers for the obvious, unsophisticated
approach where we add explicit locks to existing queues replacing
splXXX()/splx() one-for-one, and add similar lock/unlock around driver
entrypoints.  In other words, basically as in Yamamot-san's
experimental patch.  (I'll leave out receive-side driver costs for
now, solely to avoid vagaries due to different interrupt-mitigation
rates.)

We will recieve 0.8 million packets. In the "simplistic" SMP-safe
approach will acquire the ip_input queue lock, append 0.8 million
inbound packets, then release the lock.  softint processing will again
acquire the lock, extract one packet, then release the lock.

Each lock acquisition or release requires an interprocessor
synchronization event.  Those are expensive, irrespective of whether
it's a lockmgr lock, or a newlock-branch lock, or lighter-weight lock,
or a raw spinlock.  I haven't measured cost (I plan to do so over the
next few days), but let's assume its hundreds of cycles.

That's a total of: 4 * 0.8 million = 3.2 million synchronization
operations per second. Acquriing and releasing a driver lock, in order
to call an SMP-safe if_output routine for each of 0.4 million ACKs,
and we incur another 0.4 * 2 = 0.8 million synchronization events.

Running total is already 4 million synchronization events, and I
haven't even begun to consider memory allocation or deallocation:
again, a simplistic lock-per-object and an acquire/release per
allocation, and we're doubling the number of synchronization events.

Let's say, roughly 8 million synchronization operations uper second.

Plug in whatever cost (in CPU cycles) you choose.  100 cycles is a
nice round number. 100 * 8 million is 0.8 billion, or 40% of a 2GHz
CPU.

By now one should see that it's _pointless_ to pursue a strategy of
replacing existing splfoo()/splx() synchronization with SMP-safe
locks, *of any kind*, on a one-for-one basis: the cost is exorbitant.
A much better strategy is to use CPU-local synchronization, most of
the time, and resort to inter-CPU synchronzation only when necessary
(e.g., to grow per-CPU pools of mbufs, or what-have-you).

Or, where we can't do that, then at least batch the individual objects
being moved from queue to queue (or whatever) into substantial
sub-lists, and move the entire sublist in one atomic hit, to amortize
the synchronization cost over multiple objects.

(Some of you will remember me making this case a half-dozen times over
the past 7 or 8 years; others will have heard it from me in-person, or
by email, or by phone, in just the past month or so).

One person I earbashed with this argument suggested measuring the cost
of a simple atomic operation: add one atomic op on a "junk" variable
to splx(), then measure the delta. I can do that and report back next
week.

(To be honest, when I replied to Yamamot-san's "please don't" about
lock-per-IPL, I was already thinking of a tight atomic (recursive)
spinlock, not a lockmgr lock. But one of the useful things of that
'stopgap' is to help figure out which operations we do enough that we
really, really want them to be CPU-local, to avoid paying
inter-CPU-lock penalties. I would be *really* surprised if that list
matched exactly with IPL order, but who knows...)


>I think memory allocation will need to be sorted out before interrupt
>handlers can allocate memory, or at least some sort of fix will be needed
>(like something allocating at low IPL and feeding them to a pool for the
>driver at IP_NET).

Bill, I think I see what you're getting at. But if I'm reading between
the lines correctly, I think it's not going far enough.  (tho' I could
be misunderstanding you, and you're already proposing something on the
scale I see as necessary).


>To be honest, if it were only moving IP_AUDIO around, I'd say add defines
p>to the main tree to do it, so we could all choose to run the code
>Jonathan's working on. Unfortunately splvm makes it more than that.

Well, the more I think about it, the more I agree Stefan's
observation, that the first item to attack is to get a device
interrupt on CPU 0 to scehedule a softint on a _different_ CPU.

Otherwise, (as I think Jason also commented today), since we take all
interrupts on the first CPU. and (as Stefan observed) the biglock
implies we run softints on the same CPU as the hard interrupt which
triggered them.... all we're likely to do is pay dramatically more
overhead, yet most of the time, we'll still run the networking code
(hardints and softints) on the one CPU.