NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/57240: Missing store-before-store barriers in cpu_switchto on non-TSO MP platforms



>Number:         57240
>Category:       kern
>Synopsis:       Missing store-before-store barriers in cpu_switchto on non-TSO MP platforms
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Feb 20 20:20:00 +0000 2023
>Originator:     Taylor R Campbell
>Release:        current
>Organization:
The NeFoundtBSatio Dn
>Environment:
slowly suffocating on carbon gases from humanity's intoxication with fossil fuel
>Description:
The mutex(9) implementation is designed to let mutex_exit run without atomic read/modify/write (CAS, LL/SC), as long as the sequence that tests for waiters and stores to release the lock can't be preempted.

This avoids a substantial cost in uncontended mutex_exit, but requires some care both in the sleeping path of mutex_enter (specifically, mutex_vector_enter) and in the cpu_switchto path which changes which thread is running on a CPU.

The sleeping path of mutex_vector_enter relies on being able to tell whether the owner of the CPU will definitely notice that there are waiters, in which case sleeping is safe; or whether the owner may be on the CPU, may have failed to notice the waiters, and may have clobbered the waiters bit in releasing the lock, in which case sleeping may deadlock and another iteration of spinning is necessary.

Specifically, mutex_vector_enter will set the waiters bit (with an atomic read/modify/write), and then try to discern which of the six cases around possible mutex_exit and thread-switch, or thread-switch and mutex_exit, the owner may be in:

1. (a) [cpu_switchto] ci->ci_curlwp = owner; (b) [mutex_exit] mtx->mtx_owner = 0
2. (a) [mutex_exit] mtx->mtx_owner = 0; (b) [cpu_switchto] ci->ci_curlwp = otherlwp

The six cases are before 1(a), between 1(a) and 1(b), or after 1(b); or before 2(a), between 2(a) and 2(b), or after 2(b).

mutex_vector_enter loads:

1. owner := load(&mtx->mtx_owner) & MUTEX_THREAD,
2. oncpu := load(&owner->l_cpu->ci_curlwp) == owner
3. haswaiters := load(&mtx->mtx_owner) & MUTEX_BIT_WAITERS

These loads are ordered with membar_consumer (load-before-load).  The corresponding stores, in mutex_exit and cpu_switchto, must be ordered by membar_producer or equivalent (store-before-store).

On x86 and any architecture with TSO (like all sparc as far as I'm aware), no explicit barriers are needed because the loads and stores happen in order anyway and there is no store-before-load ordering, the only ordering which would require an explicit barrier.

On arm32 and aarch64, the barrier between 1(a) and 1(b) is currently in cpu_switchto; all other non-TSO MP architectures appear to be missing it.

On aarch64, it turns out that there is no need for an explicit barrier between 1(a) and 1(b), because the store to set mtx->mtx_owner = 0 only happens in one of three ways:

- with stlxr, which implies the necessary ordering anyway in a single nonpreemptible instruction that simultaneously serves as a barrier and a store;
- at splhigh (and therefore nonpreemptible) in MUTEX_RELEASE, which issues membar_release (enough for store-before-store) and then mtx->mtx_owner = 0; or
- under the mutex's turnstile lock (and therefore serialized with mutex_vector_enter) in MUTEX_RELEASE.

Finally, all non-TSO MP architectures (aarch64, alpha, arm32, ia64, mips, powerpc, riscv) appear to be missing the barrier between 2(a) and 2(b).

P.S.  On all MP architectures except x86, the mutex_exit stub uses an atomic CAS or LL/SC anyway.  (x86 uses CMPXCHG, but not LOCK CMPXCHG, so it can't be interrupted in the middle but isn't an atomic r/m/w as witnessed by other CPUs.)  And normally, mutex_vector_exit only does MUTEX_RELEASE under the mutex's turnstile lock.  So the only path that is actually affected by missing barriers is the path in mutex_vector_exit under #ifdef LOCKDEBUG which uses splhigh to avoid preemption, instead of taking the mutex's turnstile lock which would presumably sprout large volumes of useless lockdebug data.  Perhaps we could get a performance win by teaching these architectures to use RAS instead of atomic CAS or LL/SC in their mutex_exit stubs.  We could alternatively nix the splhigh/MUTEX_RELEASE path under #ifdef LOCKDEBUG in mutex_vector_exit, but that might severely damage the already-abysmal performance of LOCKDEBUG; we could alternatively put the barriers in cpu_switchto under #ifdef LOCK
 DEBUG, but further deviation between LOCKDEBUG and !LOCKDEBUG kernels might damage the value of LOCKDEBUG as a diagnostic tool, and would prevent a potential RAS optimization in mutex_exit, which is generally a much hotter path than cpu_switchto.
>How-To-Repeat:
something about running /usr/games/worms for hours under LOCKDEBUG, according to mlelstv@, who I can only conclude must be very bored
>Fix:
Change each cpu_switchto routine so that when it sets the local struct cpu_info::ci_curlwp it issues store-before-store barriers on both sides:

1. membar_producer();
2. ci->ci_curlwp = newlwp;
3. membar_producer();

Add comments explaining what's up and cross-referencing kern_mutex.c in more than the cryptic level of detail they currently have.

Exceptions for optimization:
- When entering a softint for the first time, (3) may be unnecessary because no mutexes can be held by that softint lwp, and softint lwps can't (currently?) be preempted (at most, they can only voluntarily sleep on locks).
- On aarch64, no barrier may be needed at (3) because mutex_exit always issues a barrier and store nonpreemptibly, either with an STLXR instruction, at splhigh, or with the mutex's turnstile lock held.

Update the comments in kern_mutex.c should so that they are written in terms of program ordering semantics and matching load-before-load and store-before-store barriers, not in terms of obscure implementation details like acquiring and releasing cache lines, and should cross-reference cpu_switchto.  Any exceptions should be documented both where the barrier would have gone, and where there's logic that renders that barrier unnecessary if appropriate.

Memorialize this scheme in the wiki or something and add links to it in the code.



Home | Main Index | Thread Index | Old Index