Re: __{read,write}_once

To: Maxime Villard <max%m00nbsd.net@localhost>
Subject: Re: __{read,write}_once
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Fri, 29 Nov 2019 22:28:15 +0000
> Date: Sun, 24 Nov 2019 19:25:52 +0000
> From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
> 
> This thread is not converging on consensus, so we're discussing the
> semantics and naming of these operations as core and will come back
> with a decision by the end of the week.

We (core) carefully read the thread, and discussed this and the
related Linux READ_ONCE/WRITE_ONCE macros as well as the C11 atomic
API.

   For maxv: Please add conditional definitions in <sys/atomic.h>
   according to what KCSAN needs, and use atomic_load/store_relaxed
   for counters and other integer objects in the rest of your patch.
   (I didn't see any pointer loads there.)  For uvm's lossy counters,
   please use atomic_store_relaxed(p, 1 + atomic_load_relaxed(p)) and
   not an __add_once macro -- since these should really be per-CPU
   counters, we don't want to endorse this pattern by making it
   pretty.

* Summary

We added a few macros to <sys/atomic.h> for the purpose,
atomic_load_<ordering>(p) and atomic_store_<ordering>(p,v).  The
orderings are relaxed, acquire, consume, and release, and are intended
to match C11 semantics.  See the new atomic_loadstore(9) man page for
reference.

Currently they are defined in terms of volatile loads and stores, but
we should eventually use the C11 atomic API instead in order to
provide the intended atomicity guarantees under all compilers without
having to rely on the folklore interpretations of volatile.

* Details

There are four main properties involved in the operations under
discussion:

1. No tearing.  A 32-bit write can't be split into two separate 16-bit
   writes, for instance.

   * In _some_ cases, namely aligned pointers to sufficiently small
     objects, Linux READ_ONCE/WRITE_ONCE guarantee no tearing.

   * C11 atomic_load/store guarantees no tearing -- although on large
     objects it may involve locks, requiring the C11 type qualifier
     _Atomic and changing the ABI.

   This was the primary motivation for maxv's original question.

2. No fusing.  Consecutive writes can't be combined into one, for
   instance, or a write followed by a read can't skip the read to
   return the value that was written.

   * Linux's READ_ONCE/WRITE_ONCE and C11's atomic_load/store
     guarantee no fusing.

3. Data-dependent memory ordering.  If you read a pointer, and then
   dereference the pointer (maybe plus some offset), the reads happen
   in that order.

   * Linux's READ_ONCE guarantees this by issuing the analogue of
     membar_datadep_consumer on DEC Alpha, and nothing on other CPUs.

   * C11's atomic_load guarantees this with seq_cst, acquire, or
     consume memory ordering.

4. Cost.  There's no need to incur cost of read/modify/write atomic
   operations, and for many purposes, no need to incur cost of
   memory-ordering barriers.

To express these, we've decided to add a few macros that are similar
to Linux's READ_ONCE/WRITE_ONCE and C11's atomic_load/store_explicit
but are less error-prone and less cumbersome:

#include <sys/atomic.h>

- atomic_load_relaxed(p) is like *p, but guarantees no tearing and no
  fusing.  No ordering relative to memory operations on other objects
  is guaranteed.  

- atomic_store_relaxed(p, v) is like *p = v, but guarantees no tearing
  and no fusing.  No ordering relative to memory operations on other
  objects is guaranteed.

- atomic_store_release(p, v) and atomic_load_acquire(p) are,
  respectively, like *p = v and *p, but guarantee no tearing and no
  fusing.  They _also_ guarantee for logic like

    Thread A                    Thread B
    --------                    --------
    stuff();
    atomic_store_release(p, v);
                                u = atomic_load_acquire(p);
                                things();

  that _if_ the atomic_load_acquire(p) in thread B witnesses the state
  of the object at p set by atomic_store_release(p, v) in thread A,
  then all memory operations in stuff() happen before any memory
  operations in things().

  No guarantees if only one thread participates -- the store-release
  and load-acquire _must_ be paired.

- atomic_load_consume(p) is like atomic_load_acquire(p), but it only
  guarantees ordering for data-dependent memory references.  Like
  atomic_load_acquire, it must be paired with atomic_store_release.
  However, on most CPUs, it is as _cheap_ as atomic_load_relaxed.

The atomic load/store operations are defined _only_ on objects as
large as the architecture can support -- so, for example, on 32-bit
platforms they cannot be used on 64-bit quantities; attempts to do so
will lead to compile-time errors.  They are also defined _only_ on
aligned pointers -- using them on unaligned pointers may lead to
run-time crashes, even on architectures without strict alignment
requirements.

* Why the names atomic_{load,store}_<ordering>?

- Atomic.  Although `atomic' may suggest `expensive' to some people
  (and I'm guilty of making that connection in the past), what's
  really expensive is atomic _read/modify/write_ operations and
  _memory ordering guarantees_.

  Merely preventing tearing and fusing is often cheap -- normal CPU
  load/store instructions are usually cheap and atomic, and these
  operations help to ensure that (a) we catch mistakes with aggregate
  objects like 64-bit words on a 32-bit machine, and (b) the compiler
  doesn't do any tricks behind our back to violate those guarantees.

- Load/store.  We could say read/write but we see little value in
  deviating from the modern C11 API.

- Memory ordering.  C11 defines atomic_load and atomic_store with
  _sequential consistency_, the most expensive kind of ordering -- in
  C11, there is a total order on every sequentially consistent memory
  operation that every thread shares.  So the names atomic_load and
  atomic_store would conflict with that.

  It's not obvious from the names READ_ONCE/WRITE_ONCE that any
  ordering guarantees are needed.  And for things like lossy counters,
  ordering is not needed.  But in Linux, some applications (like RCU)
  _do_ rely on ordering guarantees from READ_ONCE -- and those _must_
  be paired with ordering guarantees on the writer side in order to
  work.

  We could have adopted the rather cumbersome atomic_load_explicit and
  atomic_store_explicit from C11, but I figured it would be better if
  we just name the five useful versions with shorter names.  We see
  little value in deviating from the nomenclature in C11, since the
  terminology `relaxed', `acquire', and `release' in the literature is
  ubiquitous today (personally, I might prefer `unordered' over
  `relaxed', but not enough to warrant divergence from the literature
  and standard), and the semantics doesn't exactly match our existing
  membar_ops(3) anyway.

Thus, the names are atomic_{load,store}_* and annotated with the C11
memory ordering so you have to be clear about it -- but not quite as
cumbersome as the C11 `atomic_load_explicit(p, memory_order_acquire)'.

General rules:

- For any atomic_load_acquire or atomic_load_consume, make sure you
  can identify the atomic_store_release that it corresponds with, and
  vice versa.  Leave a code comment on each part pointing out its
  counterpart.

- Translate Linux READ_ONCE into atomic_load_consume, unless you
  _must_ operate on large or unaligned objects.
  => Optimization: If downstream memory operations do not depend on
     the value, then you can use atomic_load_relaxed.

- Translate Linux WRITE_ONCE into atomic_store_relaxed, unless you
  _must_ operate on large or unaligned objects.

* How do they relate to existing atomic_ops(3) and membar_ops(3) API?

We're still working on details, but for now, you can treat

atomic_r/m/w(p, ...);   // from atomic_ops(3), except the *_ni
membar_enter();

as a load-acquire, and

membar_exit();
atomic_r/m/w(q, ...);   // from atomic_ops(3), except the *_ni

as a store-release.  On architectures with __HAVE_ATOMIC_AS_MEMBAR,
such as x86, the membar_enter/exit is not necessary and every
atomic_r/m/w implies store-release _and_ load-acquire.

(Caveat parallel programmer: membar_enter is _not_ the same as C11
atomic_thread_fence(memory_order_acquire), and atomic_load_relaxed
followed by membar_enter() is _not_ a load-acquire, which is why this
is not the end of the story.)
References:
- Re: __{read,write}_once
  - From: Taylor R Campbell
Prev by Date: FYI: mb(9) API is finally going away
Next by Date: Re: FYI: mb(9) API is finally going away
Previous by Thread: Re: __{read,write}_once
Next by Thread: C11 memory fences, atomics, memory model
Indexes:
Home | Main Index | Thread Index | Old Index