tech-kern: Re: lock-free data structures

Subject: Re: lock-free data structures
To: David Laight <david@l8s.co.uk>
From: None <jonathan@dsg.stanford.edu>
List: tech-kern
Date: 01/05/2006 14:08:54
In message <20060105213558.GD12021@snowdrop.l8s.co.uk>,
David Laight writes:

>On Tue, Jan 03, 2006 at 10:12:34PM +1030, Brett Lymn wrote:
>> On Mon, Jan 02, 2006 at 06:03:11PM -0800, Garrett D'Amore wrote:
>> > 
>> > Isn't this the "normal" way of handling locking on SMP class
>> > machines? 
>> >
>> 
>> Well "normal" for handling locking - you have a basic instruction that
>> can perform an "atomic" instruction.  You can use a thing called
>> Peterson's Algorithm to perform the atomic lock without requiring
>> underlying hardware support.
>
>If that is the algorithm I think it might be, it doesn't work on all
>architectures.  Specificly it doesn't work on most sparc cpus.

... I think what you are trying to say is something like this:

Peterson's algorithm works only on CPUs with strong memory
ordering. One example of systems where this assumption does not hold
is SPARC systems where Peterson's algorithm doesn't work are
multiprocessor configurations with store buffers.

For an authoritative disscussion of the SPARC issues, see:
http://docs.sun.com/app/docs/doc/816-5137/6mba5vpl5?a=view

>The thing is that for any sort of compare+exchange instruction to be
>useful it must do a locked bus cycle 

I'm boggled. David, Have you never heard of ll/sc, or are you trying
to say that an atoomic compare+exchange synthesized from ll/sc
instructions is not _an_ (i.e., one) instruction?

>- so has the same basic cost
>at acquiring a lock.  

No, that's just wrong. You are conflating the costs of a _software_
lock, with the costs of a hardware atomic operation, or attempted
atomic operation.  On some systems (like mips or alpha, which combine
an ll/sc pair, which may fail, with a software loop to retry), there
isn't a 1:1 match between the two.

For a more sophisticated model, you might consider the total costs on
the memory system, which (may) impact *ALL* processors within a UMA
domain.  When considering the r4k implementation (which, if memory
serves, allowed one issued LL per CPU, and one comparator per CPU
dedicated to LL by snooping the cache-coherency porotcol for writes to
the cacheline hondling the LL address: the following SC fails if the
comparator saw a remote CPU frob the LL-pending line), that model
gives a much more useful intution about the costs of synchronization.

Apologies if I misremember the exact details, its been... what, a decade?