Re: membar_enter semantics

To: Jason Thorpe <thorpej%me.com@localhost>
Subject: Re: membar_enter semantics
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sat, 12 Feb 2022 02:09:26 +0000

> Date: Fri, 11 Feb 2022 15:47:01 -0800
> From: Jason Thorpe <thorpej%me.com@localhost>
> 
> My beef with the membar_enter definitional change is with the word
> "acquire".  I.e. you want to give it what is called today "acquire"
> semantics.  My beef is with now "acquire" is defined, as
> load-before-load/store.

Whatever the name is, do you contend that store-before-load/store is
_useful_?  Can you show why?  And, can you show an architecture where
it's actually cheaper than membar_sync?

(I can show plenty of examples of where load-before-load/store is
useful -- heck, just search for membar_enter and you'll find some!)

I would rather avoid introducing a proliferation of membar names,
because the more there are, the more confusing the choice is.  Having
nicely paired names helps: if you see `membar_exit', that's a hint you
should see a corresponding `membar_enter' -- and if you don't, that
should raise alarm bells in your head.

We could add membar_acquire/release, but `membar_exit' is already
appropriate here.  Semantically, generally load/store-before-store
(membar_exit) is appropriately paired with load-before-load/store to
make a happens-before relation that makes programs easy to reason
about.

But store-before-load/store?  Raises alarm bells of an incoherent
design or terrible choice like Dekker's algorithm.  I contend that
store-before-load/store is not worth naming -- except possibly for the
never-released riscv, we have _zero_ definitions that are cheaper than
membar_sync (and I'm not sure  fence w,rw  is actually cheaper than
fence rw,rw  on any real hardware -- likely isn't), and _zero_ uses.

> v9-PSO -- Because Atomic load-stores ("ldstub" and "casx") are not
> ordered with respect to stores, you would need "membar #StoreStore"
> (in PSO mode, Atomic load-stores are already strongly ordered with
> respect to other loads).

This is not accurate.  There is no need for `membar #StoreStore' here,
because, from the other part you quoted about PSO:

   Each load and atomic load-store instruction behaves as if it were
   followed by MEMBAR with a mask value of 05_16.

LoadLoad = 0x01, LoadStore = 0x04, so LoadLoad|LoadStore = 0x05 or
`05_16'; in other words, this is load-before-load/store.  (Confirmed
in Appendix D.5, which spells it out as MEMBAR #LoadLoad|LoadStore.)

> Now, because in PSO mode, Atomic load-stores are not strongly
> ordered with respect to stores, in order for the following code to
> work:
> 
> 	mutex_enter();
> 	*foo = 0;
> 	result = *bar;
> 	mutex_exit();
> 
> ...then you need to issue a "membar #StoreStore" because the
> ordering of the lock acquisition and the store through *foo is not
> guaranteed without it.  But you can also issue a "membar #StoreLoad
> | #StoreStore", which also works in RMO mode.

No membar needed here in PSO because the the CAS or LDSTUB in
mutex_enter already implies MEMBAR #LoadLoad|LoadStore without any
explicit instruction.  So the CAS/LDSTUB inside mutex_enter
happens-before all loads and stores afterward, namely *foo = 0 and
result = *bar.

In PSO you _do_ need MEMBAR #StoreStore in mutex_exit, even if
mutex_exit uses an atomic r/m/w to unlock the mutex, because the store
*foo = 0 could be delayed until after the atomic r/m/w inside
mutex_exit.  That's why, as you said, `atomic load-stores are not
ordered with respect to stores' -- they can be reordered _in one
direction_, which is relevant to mutex_exit but not to mutex_enter.

> In other words, it's the **store into the lock cell** that actually
> performs the acquisition of the lock.

No, it's the atomic r/m/w operation as a unit.  The operation is
atomic; there's no meaningful separation between the parts.

Even with LL/SC, the only way you can elicit a semantic difference
between the two choices of memory barrier in

   ll
   ...other logic...
   sc (repeat if failed)
   membar load-before-load/store vs store-before-load/store

is by issuing a load or store in `...other logic...' that is ordered
differently by the barrier.  The LL/SC itself functions as a single
atomic memory operation with both a load and a store, and so is
equally ordered by load-before-load/store or store-before-load/store
here.

>                                        In addition to being true on
> platforms that have Atomic load-store (like SPARC), it is also true
> on platforms that have LL/SC semantics (the load in that case
> doesn't mean jack-squat, and the ordering guarantees that the LL has
> are specifically with respect to the paired SC).

[citation needed]

Can you exhibit a program using LL/SC on one of the architectures you
have in mind, such that it behaves differently depending on which
barrier you issue -- and without cheating by using an intermediate
load or store in `...other logic...' that vacuously makes the
difference independent of the LL/SC?

If not, this is all a distinction without a difference -- any
difference boils down to how membar_enter affects memory operations
that _aren't_ atomic r/m/w (or, equivalently, LL/SC).  Which brings us
back to: What utility does store-before-load/store have?  Very little
in NetBSD, it seems!

Store-before-load ordering is generally only ever needed in weird
exotic schemes like Dekker's algorithm which you generally don't want
to use in practice, or early CPU spinup with a busy loop that is
perfectly adequately served by membar_sync or DELAY().  But
load-before-load/store, in contrast, is ubiquitous and important in
performance-critical code.

Follow-Ups:
- Re: membar_enter semantics
  - From: Taylor R Campbell
- Re: membar_enter semantics
  - From: Mouse

References:
- Re: membar_enter semantics
  - From: Jason Thorpe

Prev by Date: Re: membar_enter semantics
Next by Date: Re: membar_enter semantics
Previous by Thread: Re: membar_enter semantics
Next by Thread: Re: membar_enter semantics
Indexes:

Home | Main Index | Thread Index | Old Index