Subject: Re: lock-free data structures
To: None <tech-kern@netbsd.org>
From: Joe Seigh <jseigh_02@xemaps.com>
List: tech-kern
Date: 01/05/2006 19:41:44
(2nd attempt to post this)

zvrba@globalnet.hr wrote:

>
> Sorry if I'm a bit off-topic, but this is the only kernel-developers
> mailing list that I'm subscribed to. The question is a practical one and
> I believe that kernel developers can give the most informed answer.
>
> Recently I've come across many papers describing the construction of
> lock-free data structures using compare&exchange instructions. Such
> implementations do not require explicit locks (mutexes) in an MP
> environment. The advantages are more (no explicit locking needed) or
> less (performance gains) obvious, so the question arises: why aren't
> such data structures used more widely (in general and in the kernel
> code?)
>
> I know that Linux has adopted RCU, but again the example code for the
> linked list traversal calls rcu_read_lock() / rcu_read_unlock() (thus,
> not fulfilling the promise of lock-free code). Why didn't BSDs adopt RCU?
> [ Please, I'm not trying to start a Linux vs. BSD war! This is a simple and
> honest question related to my general inquiry. ]


rcu_read_lock() / rcu_read_unlock() are AFAIK dummy macro's, in the
non-preemptive kernel anyway, for documentation/code maintenance purposes.
The "lock" is in name only.

As mentioned elsewhere, the problem with RCU is that it is patented
except for one lapsed patent, 4,809,168.  You can implement RCU
using that but you'd have to be careful as IBM's lawyers are likely
to interpret the claims on the lapsed patent narrowly and the claims
of their active patents broadly.  And IBM's lawyers are bigger than
your lawyers.

>
> Bottom line: what's the "catch" with lock-free data structures?
>
> An immediate drawback is that not all architectures support the cmpxchg
> instruction. What are others? Is the possible performance gain negligible
> compared to the required effort of recoding a large volume of code?
>

For some lock-free algorithms you need interlocked instructions.  For
some like RCU you only need atomic stores and loads with dependent
acquire and release memory semantics.

There's some lock-free stuff I did here
http://atomic-ptr-plus.sourceforge.net/
There's fastsmr, an RCU+SMR userspace implementation, which runs on
32bit x86 Linux, 32 bit ppc OSX, and 32bit sparc Solaris.  Besides
the patent problems with RCU, there's a patent application in on
SMR hazard pointers.

The rcpc, reference counted proxy collector, is probably better since
it has no patent entanglements that I'm aware of.  It needs double wide
compare and swap (e.g. cmpxchg8b or cmpxchg16b), load locked/store conditional,
or 64 bit or larger single wide compare and swap, which probably
covers every architecture that matters except for some embedded
processors which are probably running on a single processor where
you can simulate interlocked instructions if you are running disabled.

RCU and the proxy stuff are reader lock-free.  You use them for what
I call PCOW (partial copy on write) and you need conventional mechanisms
for write access.  They're really forms of GC and are used the same
way you'd use reader lock-free in Java with JSR-133 volatile supplying
the acquire and release memory semantics.  You can use PCOW for things
besides linked lists, e.g. hash tables or trees.  I have some lock-free
b-tree code I prototyped in Java so it is possible.

-- 
Joe Seigh