Subject: Re: lock-free data structures
To: Garrett D'Amore <garrett_damore@tadpole.com>
From: None <jonathan@dsg.stanford.edu>
List: tech-kern
Date: 01/04/2006 17:02:19
In message <43BC46FD.5090800@tadpole.com>,
"Garrett D'Amore" writes:

>All this conversation really seems to beg one question, which may be
>obvious to everyone else, but I figure I ought to ask it anyway.
>
>Is there a plan to move from the biglock to fine-grained locking?

I have (more than once) tried to outline a roadmap for receive-side
network processing out from under the big-lock.  I see that as a
non-negotiable requirement, if NetBSD is to be a viable contender with
other open-source OSes, for certain performance-intensive
applications.

As Steven M. Bellovin outlined some months ago, the driving factor is
that, for about the last 3 years, CPUs aren't getting exponentially
faster; instead, we're getting more cores (possibly exponentially
more, possibly not) cores on individual chips.

Consider 10GbE, or better still, multiple 10GbE links. For the sake of
discussion, lets disregard TCP offload engines (TOEs).  Using standard
1500-byte packets, a 10GbE link can send on the close order of 800,000
1500-byte packets, plus (for TCP, with the standard ACK every-
other-segment), another 400,000 ACK-only packets. 

That's unidirectional; when using full-duplex, or sinking and
retransmitting a half-duplex stream, one has to handle roughly 2.4
million packets/sec.

If one considers the falling cost of multi-core CPUs, then within a
couple of years, a viable OS will need to support a parallelized TCP
stack, in which one CPU handles interrupts, another CPU runs the TCP
state machine, and yet another CPU handles the copying to or from
user-space.  That's hardly rocket science; IRIX could do that
(with 6.4GBit/sec HIPPI and 64k frames) nearly 10 years ago.

You doesn't have to be a rocket scientist or an LLNL atomic scientist
to conclude that raising and lowering IPL some 5 million times per
second (one raise/lower to allocate, one to free) is a losing
proposition.

I know how _I'd_ start to work toward that: 

1. Multiple per-CPU listheads for allocating and freeing mbufs nad
clusters, one for use when already at IPL_NET, and one for use below
IPL_NET.  Separate allocators (mget(), mget_iplnet()) and deallocators
(mfree(), mfree_iplnet()).  Mbuf allocation and deallocation then
becomes synchronization-free when called from inside network drivers
at IPL_NET, except in the (very rare) case of having to acquire the
big-lock to create new mbufs from a pool or shuffle from a global
(biglock-protected, at first) mbuf queue.


2. A new data-structure (call it an mbq), with head and tail pointers
to a chain of mbuf packets; a per-CPU of pending newly-arrived
packets; 

3a. A way to move all packets in an mbq into an ifq (e..g, ipintrq)
cheaply, in one synchronization operation, and

3b. A (per-processor?) mbq, onto which low-level IPL_NET code can
temporarily dequeue packets for later deposition onto a layer-2 input
ifq every K packets or when returning from an IPL_Net interrupt;

4. An interprocessor communication mechanism (interprocessor interrupt
or other) whereby one CPU busy handing hardware interrupts can wake up
a different CPU, and force that other second CPU to start running our
current softint network-processing code (e.g., ipintr()).

Unfortunately, even starting to commit even _that_ much keeps bogging
down in details, and people who insist on making progress elsewhere
before touching the networking code.

Oh, and then Bill observed recently that IPIs are not well-ordered in
a machine-independent way, relative to other interupts (in other
words, the ordering is machine-DEpendent). Which makes the idea of
parallelizing the networking code somewhat moot......