Subject: Re: lock-free data structures
To: None <jonathan@dsg.stanford.edu>
From: David Laight <david@l8s.co.uk>
List: tech-kern
Date: 01/05/2006 22:24:57
On Thu, Jan 05, 2006 at 02:08:54PM -0800, jonathan@dsg.stanford.edu wrote:
> 
> ... I think what you are trying to say is something like this:
> 
> Peterson's algorithm works only on CPUs with strong memory
> ordering. One example of systems where this assumption does not hold
> is SPARC systems where Peterson's algorithm doesn't work are
> multiprocessor configurations with store buffers.

Sound good....

> >The thing is that for any sort of compare+exchange instruction to be
> >useful it must do a locked bus cycle 
> 
> I'm boggled. David, Have you never heard of ll/sc, or are you trying
> to say that an atoomic compare+exchange synthesized from ll/sc
> instructions is not _an_ (i.e., one) instruction?
> 
> >- so has the same basic cost
> >at acquiring a lock.  
> 
> No, that's just wrong. You are conflating the costs of a _software_
> lock, with the costs of a hardware atomic operation, or attempted
> atomic operation.  On some systems (like mips or alpha, which combine
> an ll/sc pair, which may fail, with a software loop to retry), there
> isn't a 1:1 match between the two.
> 
> For a more sophisticated model, you might consider the total costs on
> the memory system, which (may) impact *ALL* processors within a UMA
> domain.  When considering the r4k implementation (which, if memory
> serves, allowed one issued LL per CPU, and one comparator per CPU
> dedicated to LL by snooping the cache-coherency porotcol for writes to
> the cacheline hondling the LL address: the following SC fails if the
> comparator saw a remote CPU frob the LL-pending line), that model
> gives a much more useful intution about the costs of synchronization.

Sounds like what I had in mind...

> Apologies if I misremember the exact details, its been... what, a decade?

What I intended to say, is that the rmw cycle done by a exchange instruction
(or a compare an exchange) is much the same expensive operation as
acquiring an lock (given you expect it to be uncontended).

The lock release is typiaclly free - since it gets snooped by the next
request.

There is also the issue that with expensive (ie slow) hardware rmw cycles
the lock memory location itself becomes a 'hotspot' so you need finer
grain locking.

But yes we need per-cpu free lists [1], per-cpu stats.
I'd go for ditching all these separate pools we currently have for
a decent SMP allocator.

For TCP is it probably enough to get data for separate connections
running in parallel....

	David

[1] and being aware of the code that tends to 'pump' free items from
one cpu to another.....

-- 
David Laight: david@l8s.co.uk