Re: Adaptive RW locks

To: Andrew Doran <ad%netbsd.org@localhost>
Subject: Re: Adaptive RW locks
From: Mateusz Guzik <mjguzik%gmail.com@localhost>
Date: Wed, 11 Mar 2020 12:57:55 +0100

 On 3/10/20, Andrew Doran <ad%netbsd.org@localhost> wrote:
> Following up again..  It turns out the deadlocks I saw with this were down
> to a problem with kernel_lock that has since been solved.
>
> I made a couple more tweaks to it: up the max number of tracked holds to 8
> because vmobjlock became a rwlock, and because aiodoned is gone now be more
> selective giving up and going to sleep in the softint case (walk down the
> list of pinned LWPs on curcpu and see if any of them holds the lock that's
> wanted, if not held on curcpu then some other CPU has it, and it's OK to
> spin).

I think the proposed patch avoidably adds overhead to the most common use
case to cover a corner case. Sensibly bounded spinning on readers can be
done in a speculative fashion, i.e. without actively tracking if there are
threads off cpu.

FreeBSD is not completely there yet, but general idea is pretty trivial:
once a writer shows up it sets a bit (RW_SPINNER or so), then it can
observe whether read owners are draining with some pre-defined upper
bound on how long it is willing to wait. There is no spinning if there are
blocked to-be writers already. Thus this might get some waste initially,
but it is bounded.

Side note is that waiting for threads to drain can use the reader count
to speculate how long to wait before re-reading the lock instead of doing
it every pause.

>  void
> +rw_enter(krwlock_t *rw, const krw_t op)
> +{
> [..]
> +
> +     /*
> +      * Read the lock owner field.  If the need-to-wait
> +      * indicator is clear, then try to acquire the lock.
> +      */
> +     owner = rw->rw_owner;
> +     if (__predict_true((owner & need_wait) == 0)) {
> +             next = rw_cas(rw, owner, (owner + incr) & mask);
> +             if (__predict_true(next == owner)) {
> +                     /* Got it! */
> +                     RW_LOCKED(rw, op);
> +                     KPREEMPT_ENABLE(l);
> +                     RW_MEMBAR_ENTER();
> +                     return;
> +             }
> +     }

Both this and the unlock path lose on throughput if the lock is heavily
used by readers.

The original amd64 fast path has a tight loop which immediately retries
cmpxchg, while the patched variant aborts immediately.

Assume the lock is only used by readers. Patched rw_enter will always
conclude it can rw_cas on it. However, this can very easily fail if
racing against another rw_enter or rw_exit. In this case this will fall
to the slow path which starts with a re-read and then performs a tight
loop.

Whether a strict tight loop is a good idea is highly dependent, I believe
it is the best thing to do on amd64 by default. Other architectures may
want some form of pause-equivalent. Nonetheless, the fallback to re-read
is definitely pessimal and will keep happening under load.

I don't know if worth the effort at this stage, but this will eventually
have to morph into a xadd-based lock. As noted elsewhere cmpxchg-baesd
loops degrade under load to a significantly higher degree (of course
everything is already pretty bad if you have a heavily bounced lock
in the first place). Note that the conversion would also take care of
the pause problem.

I was looking at doing it in FreeBSD but there are some constraints which
pose important design questions and to my understanding your code shares
most of them (if not all).

Blindly adding into the lock word means there is no space to store the
entire thread pointer (since it could partially overwrite the address),
but the owner needs to be findable in some manner in order to lend it
priority.

Assuming 32-bit architectures can stick to the current code there are
some options without growing the lock.

One variant was "compressing" the pointer -- if all threads are guaranteed
to come from a pre-defined VA range the lock can only store the relevant
portion of the address which can be pretty small. Alternatively there
can be a lockless hash of threads keyed by thread id. Or this can can
use a variant of storing the lock which you implemented here, except by
the writer. It can denote in the lock it went off cpu with the relevant
turnstile lock held. Then everyone waiting can look it up based on the
lock.

Of course one can also "give up" and add another word for the pointer,
then this mostly degrades to regular read-write lock problems.

-- 
Mateusz Guzik <mjguzik gmail.com>

References:
- Adaptive RW locks
  - From: Andrew Doran
- Re: Adaptive RW locks
  - From: Takashi YAMAMOTO
- Re: Adaptive RW locks
  - From: Andrew Doran
- Re: Adaptive RW locks
  - From: Andrew Doran
- Re: Adaptive RW locks
  - From: Andrew Doran

Prev by Date: Re: Adaptive RW locks
Next by Date: Re: Please review: lookup changes
Previous by Thread: Re: Adaptive RW locks
Next by Thread: libc.so's vmobjlock / v_interlock
Indexes:

Home | Main Index | Thread Index | Old Index