tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: membar_enter semantics
> Date: Tue, 15 Feb 2022 12:54:09 +0100
> From: Edgar Fuß <ef%math.uni-bonn.de@localhost>
>
> If there's a widely adopted terminology, one should probably stick
> to it even if the wording is counter-intuitive or misleading (but
> note that fact in the documentation). After all, Simple Groups are
> not easy at all and you need to know about Galois Theory to
> understand why Solvable Groups are named that way.
The C11 terminology -- which is widely used in the literature these
days -- is store-release and load-acquire operations:
store-release(p, v) is roughly *p = v, and
u = load-acquire(p) is roughly u = *p,
with the additional property that if
1. one thread does some memory operations A, and then
2. issues store-release(p, v); and
3. another thread doing u = load-acquire(p) witnesses that store, and
4. proceeds to do some other memory operations B,
then all the memory operations A happen before all the memory
operations in B -- even if they were in different threads, on
different CPUs, with crazy relaxed memory models like SPARC RMO or
Alpha.
Store-release is like a load/store-before-store barrier and then a
store, and load-acquire is like a load and then load-before-load/store
barrier. Except, because store-release and load-acquire only impose
ordering relative to a _specific_ store or load, they may be cheaper
than that (e.g., on Arm, there are LDAR and STLR instructions for
individual load-acquire and store-release operations on a specific
address).
C11 introduces atomic operations with an argument specifying memory
ordering, so you can write, e.g.,
use_stuff(obj->data);
/* obj->refcnt += -1 */
atomic_fetch_add_explicit(&obj->refcnt, -1,
memory_order_release);
in one thread to drop a reference count as a store-release; then,
another thread watching to see if it has gone to zero can use a
load-acquire to make sure all memory operations done by users of the
object happen before all memory operations used to free resources
connected to it:
if (atomic_load_explicit(&obj->refcnt, memory_order_acquire) == 0) {
clear(obj->data);
free_resources(obj->data);
}
In these fragments, if the store is observed by the load, then we're
guaranteed that
use_stuff(obj->data);
happens before
clear(obj->data);
free_resources(obj->data);
(These fragments are just to illustrate reasoning about ordering --
there's generally going to be a lot more to a real-world program doing
reference-counting, like a lock and a condvar around dropping the last
reference or something.)
If I were naming NetBSD's membars today, I might choose the names
membar_acquire for load-before-load/store and membar_release for
load/store-before-store.
But we've already had the names membar_enter and membar_exit for a
while, and they've gathered a lot of definitions and uses...and it
_turns out_ that, despite the word `store' in the man page,
membar_enter was always implemented (except on riscv, which has never
shipped) _and_ used (except in one place, where no barrier was needed
anyway) as load-before-load/store, not as store-before-load/store.
(Of course, atomic-r/m/w-then-load/store is ordered equally by
load-before-load/store or store-before-load/store, but on every every
CPU architecture I know with the distinction, load-before-load/store
is cheaper than store-before-load/store -- and load-before-load/store
is _useful_ for non-atomic-r/m/w uses, while store-before-load/store
is not.)
So I think we should just fix the man page -- and spend any
_engineering_ effort on adopting C11 atomics.
> When I read frozz_enter() and frozz_exit() in code, my expectation
> is that every call fo enter is paired with a call to exit _in the
> control flow_, i.e., there's no (other than panic) code path that
> goes through one of them, but not the other.
This is the case for the memory barriers issued by mutexes:
mutex_enter essentially implies membar_enter, and mutex_exit
essentially implies membar_exit. So if you write another kind of
lock, say with atomic_cas, that's what you'll want to use:
myspinlock_enter()
{
while (atomic_cas(&lock, 0, 1) != 0)
continue
membar_enter()
}
myspinlock_exit()
{
membar_exit()
lock = 0
}
But the important thing about memory barriers is not so much that
_one_ sequential fragment of code looks like this (if you look under
the hood of what mutex_spin_enter/exit essentially do):
while (atomic_cas(&lock, 0, 1) != 0)
continue
membar_enter()
...memory operations on data...
membar_exit()
lock = 0
Rather, what's really important is that the code running on _two CPUs_
looks like this:
// cpu0 // cpu1
...memory operations A...
membar_exit()
lock = 0
while (atomic_cas(&lock, 0, 1) != 0)
continue
membar_enter()
...memory operations B...
In other words, memory barriers aren't really about ordering _two_
memory operations in _one_ thread -- they're about ordering _four_
operations in _two_ threads:
1. memory operations A, on cpu0
2. lock = 0, on cpu0
3. atomic_cas(&lock, 0, 1), on cpu1
4. memory operations B, on cpu1
Our real goal here (other than the mutual exclusion part of a mutex)
is to ensure that memory operations A and memory operations B happen
in that order. And it's only when _both threads_ issue matching
memory barriers that everything works out.
> Would it make sense to call the intended-usage aliases something like
> push/pull, provide/consume or publish/whatever?
Well, what we have now are:
membar_enter and membar_exit
membar_consumer and membar_producer
I think these are pretty good pairs, as far as English words go. My
only issue with them right now is the text about membar_enter in the
man page -- it disagrees with 15 years of practice in NetBSD _and_
with sensible algorithm design.
(There's also membar_sync, of which there are (as far as I can tell)
zero legitimate uses in-tree, because almost nothing ever needs
store-before-load ordering except exotic kludges like Dekker's
algorithm -- almost everything is satisfied by release/acquire
ordering or less.)
Home |
Main Index |
Thread Index |
Old Index