tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: membar_enter semantics
> Date: Fri, 11 Feb 2022 15:47:01 -0800
> From: Jason Thorpe <thorpej%me.com@localhost>
>
> My beef with the membar_enter definitional change is with the word
> "acquire". I.e. you want to give it what is called today "acquire"
> semantics. My beef is with now "acquire" is defined, as
> load-before-load/store.
Whatever the name is, do you contend that store-before-load/store is
_useful_? Can you show why? And, can you show an architecture where
it's actually cheaper than membar_sync?
(I can show plenty of examples of where load-before-load/store is
useful -- heck, just search for membar_enter and you'll find some!)
I would rather avoid introducing a proliferation of membar names,
because the more there are, the more confusing the choice is. Having
nicely paired names helps: if you see `membar_exit', that's a hint you
should see a corresponding `membar_enter' -- and if you don't, that
should raise alarm bells in your head.
We could add membar_acquire/release, but `membar_exit' is already
appropriate here. Semantically, generally load/store-before-store
(membar_exit) is appropriately paired with load-before-load/store to
make a happens-before relation that makes programs easy to reason
about.
But store-before-load/store? Raises alarm bells of an incoherent
design or terrible choice like Dekker's algorithm. I contend that
store-before-load/store is not worth naming -- except possibly for the
never-released riscv, we have _zero_ definitions that are cheaper than
membar_sync (and I'm not sure fence w,rw is actually cheaper than
fence rw,rw on any real hardware -- likely isn't), and _zero_ uses.
> v9-PSO -- Because Atomic load-stores ("ldstub" and "casx") are not
> ordered with respect to stores, you would need "membar #StoreStore"
> (in PSO mode, Atomic load-stores are already strongly ordered with
> respect to other loads).
This is not accurate. There is no need for `membar #StoreStore' here,
because, from the other part you quoted about PSO:
Each load and atomic load-store instruction behaves as if it were
followed by MEMBAR with a mask value of 05_16.
LoadLoad = 0x01, LoadStore = 0x04, so LoadLoad|LoadStore = 0x05 or
`05_16'; in other words, this is load-before-load/store. (Confirmed
in Appendix D.5, which spells it out as MEMBAR #LoadLoad|LoadStore.)
> Now, because in PSO mode, Atomic load-stores are not strongly
> ordered with respect to stores, in order for the following code to
> work:
>
> mutex_enter();
> *foo = 0;
> result = *bar;
> mutex_exit();
>
> ...then you need to issue a "membar #StoreStore" because the
> ordering of the lock acquisition and the store through *foo is not
> guaranteed without it. But you can also issue a "membar #StoreLoad
> | #StoreStore", which also works in RMO mode.
No membar needed here in PSO because the the CAS or LDSTUB in
mutex_enter already implies MEMBAR #LoadLoad|LoadStore without any
explicit instruction. So the CAS/LDSTUB inside mutex_enter
happens-before all loads and stores afterward, namely *foo = 0 and
result = *bar.
In PSO you _do_ need MEMBAR #StoreStore in mutex_exit, even if
mutex_exit uses an atomic r/m/w to unlock the mutex, because the store
*foo = 0 could be delayed until after the atomic r/m/w inside
mutex_exit. That's why, as you said, `atomic load-stores are not
ordered with respect to stores' -- they can be reordered _in one
direction_, which is relevant to mutex_exit but not to mutex_enter.
> In other words, it's the **store into the lock cell** that actually
> performs the acquisition of the lock.
No, it's the atomic r/m/w operation as a unit. The operation is
atomic; there's no meaningful separation between the parts.
Even with LL/SC, the only way you can elicit a semantic difference
between the two choices of memory barrier in
ll
...other logic...
sc (repeat if failed)
membar load-before-load/store vs store-before-load/store
is by issuing a load or store in `...other logic...' that is ordered
differently by the barrier. The LL/SC itself functions as a single
atomic memory operation with both a load and a store, and so is
equally ordered by load-before-load/store or store-before-load/store
here.
> In addition to being true on
> platforms that have Atomic load-store (like SPARC), it is also true
> on platforms that have LL/SC semantics (the load in that case
> doesn't mean jack-squat, and the ordering guarantees that the LL has
> are specifically with respect to the paired SC).
[citation needed]
Can you exhibit a program using LL/SC on one of the architectures you
have in mind, such that it behaves differently depending on which
barrier you issue -- and without cheating by using an intermediate
load or store in `...other logic...' that vacuously makes the
difference independent of the LL/SC?
If not, this is all a distinction without a difference -- any
difference boils down to how membar_enter affects memory operations
that _aren't_ atomic r/m/w (or, equivalently, LL/SC). Which brings us
back to: What utility does store-before-load/store have? Very little
in NetBSD, it seems!
Store-before-load ordering is generally only ever needed in weird
exotic schemes like Dekker's algorithm which you generally don't want
to use in practice, or early CPU spinup with a busy loop that is
perfectly adequately served by membar_sync or DELAY(). But
load-before-load/store, in contrast, is ubiquitous and important in
performance-critical code.
Home |
Main Index |
Thread Index |
Old Index