Subject: Re: lock-free data structures
To: None <jonathan@dsg.stanford.edu>
From: David Laight <david@l8s.co.uk>
List: tech-kern
Date: 01/05/2006 22:24:57
On Thu, Jan 05, 2006 at 02:08:54PM -0800, jonathan@dsg.stanford.edu wrote:
>
> ... I think what you are trying to say is something like this:
>
> Peterson's algorithm works only on CPUs with strong memory
> ordering. One example of systems where this assumption does not hold
> is SPARC systems where Peterson's algorithm doesn't work are
> multiprocessor configurations with store buffers.
Sound good....
> >The thing is that for any sort of compare+exchange instruction to be
> >useful it must do a locked bus cycle
>
> I'm boggled. David, Have you never heard of ll/sc, or are you trying
> to say that an atoomic compare+exchange synthesized from ll/sc
> instructions is not _an_ (i.e., one) instruction?
>
> >- so has the same basic cost
> >at acquiring a lock.
>
> No, that's just wrong. You are conflating the costs of a _software_
> lock, with the costs of a hardware atomic operation, or attempted
> atomic operation. On some systems (like mips or alpha, which combine
> an ll/sc pair, which may fail, with a software loop to retry), there
> isn't a 1:1 match between the two.
>
> For a more sophisticated model, you might consider the total costs on
> the memory system, which (may) impact *ALL* processors within a UMA
> domain. When considering the r4k implementation (which, if memory
> serves, allowed one issued LL per CPU, and one comparator per CPU
> dedicated to LL by snooping the cache-coherency porotcol for writes to
> the cacheline hondling the LL address: the following SC fails if the
> comparator saw a remote CPU frob the LL-pending line), that model
> gives a much more useful intution about the costs of synchronization.
Sounds like what I had in mind...
> Apologies if I misremember the exact details, its been... what, a decade?
What I intended to say, is that the rmw cycle done by a exchange instruction
(or a compare an exchange) is much the same expensive operation as
acquiring an lock (given you expect it to be uncontended).
The lock release is typiaclly free - since it gets snooped by the next
request.
There is also the issue that with expensive (ie slow) hardware rmw cycles
the lock memory location itself becomes a 'hotspot' so you need finer
grain locking.
But yes we need per-cpu free lists [1], per-cpu stats.
I'd go for ditching all these separate pools we currently have for
a decent SMP allocator.
For TCP is it probably enough to get data for separate connections
running in parallel....
David
[1] and being aware of the code that tends to 'pump' free items from
one cpu to another.....
--
David Laight: david@l8s.co.uk