tech-kern: Re: lock-free data structures

Subject: Re: lock-free data structures
To: Ignatios Souvatzis <ignatios@cs.uni-bonn.de>
From: None <jonathan@dsg.stanford.edu>
List: tech-kern
Date: 01/04/2006 09:38:20
In message <20060104170125.GA9548@cs.uni-bonn.de>,
Ignatios Souvatzis writes:

>On Wed, Jan 04, 2006 at 04:08:13PM +0000, Michael van Elst wrote:
>> ignatios@cs.uni-bonn.de (Ignatios Souvatzis) writes:
>> 
>> >So... as the Amiga is a single-CPU machine (well, as supported by
>o 
>> >NetBSD) would it just work?
>> 
>> I guess so. But using BSET/BCLR instead will definitely work
>> while CAS could possibly fail.
>
>Hm.
>
>The problem is - with bset/bclr you can't implement list
>insertion/deletion directly, which was what triggered this thread
>above.  Well, you can indirectly by using bset/bclr to implement a
>lock.
>
>(Btw, I remember an old paper about using CAS2 for more complicated
>structure manipulations - this would be slow on 68060 machines, as CAS2
>(and unaligned CAS) trap on the '060 and have to be emulated.)

You're probably remembering either Massalin and Pu (Synthesis kernel),
or work by Michael Greenwald:

	Greenwald, M. Non-Blocking Synchronization and System Design.
	PhD thesis, Stanford University Technical Report
	STAN-CS-TR-99-1624, Stanford, CA 94305.

"Slow" was an understatement.  Michael's implementation (and the V++
kernel/cachekernel) ran on 25MHz 68040s. CAS2 stalled the entire
pipeline, dithered internally for hundreds of cycles, issued the
memory ops, then dithered internally for more cycles.

After a little thought, it shouild become obvious that CAS is simply
not adequate for efficient lock-free synchronization on datastructres
more complicated than a singly-linked list. (Yes, solutions like
Herlihy's work exist, but they are not sufficiently efficient for use
in an OS kernel).

For practical kernel use, one will soon find one *needs* a DCAS/CAS2,
or some suitable approximation. 

One approximation that's moderately well-known by practitioners in the
field is to do a ptrdiff_t-sized CAS, where the ptrdiff_t is in
interpreted _not_ as an address, but as an index into some type-stable
array of actual pointers (or datastructures).

Given a double-address, double-operand CAS, one can bundle up ``more
work'', as described in (e.g., Greewald's dissertation.


One more part of that body of work was my invention of LLP/SCP:

	ll ...
	llp
	scp
	sc

the llp and scp are "pipelined" load-llink/store-conditional: either
both scp/and sc atomically succeeed, or both fail). This architectural
extension allows one to synthesize a DCAS [IBM-speak] or CAS2 [68k].

Michael and I gave a presentation on this idea to SGI, who intended to
implement it in the (then-called r20k); but after that project was
cancelled it never saw the light of day.  An abstract of the talk is
still avaiable at, um,
http://www-dsg.stanford.edu/michaelg/abstracts/97sgi.abstract

Greg Chesson and his team used a similar trick in IRIX. That trick
relied on non-architecturally-mandated aspects of the r4k and later
LL/SC implementation (viz, that per-CPU indivisibility of those
synchronization instruction applied to entire cache lines, not the
strict address). Essentially, that trick combined LL/SC and
modification of part of the same cache line inside the ll/sc loop,
*knowing* that the ll/sc would effectively protect other writes to the
same cacheline.

(Burying such tricks deep inside the OS is the kind of thing which
causes subsequent microprocessor architects heart trouble; which is
one reason the LLP/SCP was well-received by [then] MIPS implementors.)

Oh, one more thing: to suggest using ras(9) is proof-positive that the
suggester simply doesn't understand the problem domain.  Bershad's
work on restartable atomic sequences, of which ras(9) is a
re-implementation, gives atomic semantics to *USERSPACE PROCESSES*,
implemented by adding "check for RAS collision during trap-to-OS"
hooks and PC-unwind in one (or a very, very small integer number) of
codepaths. Thus, RAS is not suitable for use *inside* a kernel.

As a historical note, the main motivation for Brian Bershad's RAS work
was the availalility of (at the time) fast, inexpensive R3K-based
machines, which had no support for atomic instructions at *all*.  RAS
was a big win for userspace threading or synchronization, compared to
the alternative of trapping into the kernel to execute an atomic
operation with interrupts blocked.