Subject: Re: lock-free data structures
To: David Laight <david@l8s.co.uk>
From: None <jonathan@dsg.stanford.edu>
List: tech-kern
Date: 01/05/2006 17:41:51
In message <20060105222457.GJ12021@snowdrop.l8s.co.uk>,
David Laight writes:

[..]

>> Apologies if I misremember the exact details, its been... what, a decade?
>
>What I intended to say, is that the rmw cycle done by a exchange instruction
>(or a compare an exchange) is much the same expensive operation as
>acquiring an lock (given you expect it to be uncontended).
>
>The lock release is typiaclly free - since it gets snooped by the next
>request.
>
>There is also the issue that with expensive (ie slow) hardware rmw cycles
>the lock memory location itself becomes a 'hotspot' so you need finer
>grain locking.

Well, but here you are taking as axiomatic that the atomic operations
*is* implemented as a hardware lock on the memory subsystem.  Whereas
ll/sc, or CAS/CAS2, aren't specified that way and typically don't
actually work that way.

>But yes we need per-cpu free lists [1], per-cpu stats.
>I'd go for ditching all these separate pools we currently have for
>a decent SMP allocator.

"Decent" is in the eye of the beholder. Baed on the numbers I posted
earlier, I specifically want multiple {allocator, deallocator} pairs
for the same (type-stable) kind of underlying object, viz, mbufs, mbuf
clusers, and (mbuf header plus cluster) pairs. 

A patch is probably shorter than text at this point....



>For TCP is it probably enough to get data for separate connections
>running in parallel....

Do you mean multiple 10GbE streams in parallel? That'd be nice, but
seeing as we currently can't handle more than about 500 Mbyte/sec, we
have a long way to go to get there.

Conversely, if what your'e trying to say is something like

  ``multiple single 1GbE streeams running in parallel, using at most
   1 CPU pre stream, is probably enough''

then I disagree. Down that path lies the same, well-known [1] set of
naive [2], which are well-known to yield a stack in which single-stream
performance is slower, considerably slower, than the status-quo-ante:
namely, well-tuned, biglock, spl-synchronized, code.

[1, 2]: as seen by people who have prior experience with parallelising
high-performance TCP/IP stacks for SMP hardware, for single flows
fatter than a single CPU can handle.  Think IRIX.


>	David
>
>[1] and being aware of the code that tends to 'pump' free items from
>one cpu to another.....