tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: __{read,write}_once
> Date: Sun, 24 Nov 2019 19:25:52 +0000
> From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
>
> This thread is not converging on consensus, so we're discussing the
> semantics and naming of these operations as core and will come back
> with a decision by the end of the week.
We (core) carefully read the thread, and discussed this and the
related Linux READ_ONCE/WRITE_ONCE macros as well as the C11 atomic
API.
For maxv: Please add conditional definitions in <sys/atomic.h>
according to what KCSAN needs, and use atomic_load/store_relaxed
for counters and other integer objects in the rest of your patch.
(I didn't see any pointer loads there.) For uvm's lossy counters,
please use atomic_store_relaxed(p, 1 + atomic_load_relaxed(p)) and
not an __add_once macro -- since these should really be per-CPU
counters, we don't want to endorse this pattern by making it
pretty.
* Summary
We added a few macros to <sys/atomic.h> for the purpose,
atomic_load_<ordering>(p) and atomic_store_<ordering>(p,v). The
orderings are relaxed, acquire, consume, and release, and are intended
to match C11 semantics. See the new atomic_loadstore(9) man page for
reference.
Currently they are defined in terms of volatile loads and stores, but
we should eventually use the C11 atomic API instead in order to
provide the intended atomicity guarantees under all compilers without
having to rely on the folklore interpretations of volatile.
* Details
There are four main properties involved in the operations under
discussion:
1. No tearing. A 32-bit write can't be split into two separate 16-bit
writes, for instance.
* In _some_ cases, namely aligned pointers to sufficiently small
objects, Linux READ_ONCE/WRITE_ONCE guarantee no tearing.
* C11 atomic_load/store guarantees no tearing -- although on large
objects it may involve locks, requiring the C11 type qualifier
_Atomic and changing the ABI.
This was the primary motivation for maxv's original question.
2. No fusing. Consecutive writes can't be combined into one, for
instance, or a write followed by a read can't skip the read to
return the value that was written.
* Linux's READ_ONCE/WRITE_ONCE and C11's atomic_load/store
guarantee no fusing.
3. Data-dependent memory ordering. If you read a pointer, and then
dereference the pointer (maybe plus some offset), the reads happen
in that order.
* Linux's READ_ONCE guarantees this by issuing the analogue of
membar_datadep_consumer on DEC Alpha, and nothing on other CPUs.
* C11's atomic_load guarantees this with seq_cst, acquire, or
consume memory ordering.
4. Cost. There's no need to incur cost of read/modify/write atomic
operations, and for many purposes, no need to incur cost of
memory-ordering barriers.
To express these, we've decided to add a few macros that are similar
to Linux's READ_ONCE/WRITE_ONCE and C11's atomic_load/store_explicit
but are less error-prone and less cumbersome:
#include <sys/atomic.h>
- atomic_load_relaxed(p) is like *p, but guarantees no tearing and no
fusing. No ordering relative to memory operations on other objects
is guaranteed.
- atomic_store_relaxed(p, v) is like *p = v, but guarantees no tearing
and no fusing. No ordering relative to memory operations on other
objects is guaranteed.
- atomic_store_release(p, v) and atomic_load_acquire(p) are,
respectively, like *p = v and *p, but guarantee no tearing and no
fusing. They _also_ guarantee for logic like
Thread A Thread B
-------- --------
stuff();
atomic_store_release(p, v);
u = atomic_load_acquire(p);
things();
that _if_ the atomic_load_acquire(p) in thread B witnesses the state
of the object at p set by atomic_store_release(p, v) in thread A,
then all memory operations in stuff() happen before any memory
operations in things().
No guarantees if only one thread participates -- the store-release
and load-acquire _must_ be paired.
- atomic_load_consume(p) is like atomic_load_acquire(p), but it only
guarantees ordering for data-dependent memory references. Like
atomic_load_acquire, it must be paired with atomic_store_release.
However, on most CPUs, it is as _cheap_ as atomic_load_relaxed.
The atomic load/store operations are defined _only_ on objects as
large as the architecture can support -- so, for example, on 32-bit
platforms they cannot be used on 64-bit quantities; attempts to do so
will lead to compile-time errors. They are also defined _only_ on
aligned pointers -- using them on unaligned pointers may lead to
run-time crashes, even on architectures without strict alignment
requirements.
* Why the names atomic_{load,store}_<ordering>?
- Atomic. Although `atomic' may suggest `expensive' to some people
(and I'm guilty of making that connection in the past), what's
really expensive is atomic _read/modify/write_ operations and
_memory ordering guarantees_.
Merely preventing tearing and fusing is often cheap -- normal CPU
load/store instructions are usually cheap and atomic, and these
operations help to ensure that (a) we catch mistakes with aggregate
objects like 64-bit words on a 32-bit machine, and (b) the compiler
doesn't do any tricks behind our back to violate those guarantees.
- Load/store. We could say read/write but we see little value in
deviating from the modern C11 API.
- Memory ordering. C11 defines atomic_load and atomic_store with
_sequential consistency_, the most expensive kind of ordering -- in
C11, there is a total order on every sequentially consistent memory
operation that every thread shares. So the names atomic_load and
atomic_store would conflict with that.
It's not obvious from the names READ_ONCE/WRITE_ONCE that any
ordering guarantees are needed. And for things like lossy counters,
ordering is not needed. But in Linux, some applications (like RCU)
_do_ rely on ordering guarantees from READ_ONCE -- and those _must_
be paired with ordering guarantees on the writer side in order to
work.
We could have adopted the rather cumbersome atomic_load_explicit and
atomic_store_explicit from C11, but I figured it would be better if
we just name the five useful versions with shorter names. We see
little value in deviating from the nomenclature in C11, since the
terminology `relaxed', `acquire', and `release' in the literature is
ubiquitous today (personally, I might prefer `unordered' over
`relaxed', but not enough to warrant divergence from the literature
and standard), and the semantics doesn't exactly match our existing
membar_ops(3) anyway.
Thus, the names are atomic_{load,store}_* and annotated with the C11
memory ordering so you have to be clear about it -- but not quite as
cumbersome as the C11 `atomic_load_explicit(p, memory_order_acquire)'.
General rules:
- For any atomic_load_acquire or atomic_load_consume, make sure you
can identify the atomic_store_release that it corresponds with, and
vice versa. Leave a code comment on each part pointing out its
counterpart.
- Translate Linux READ_ONCE into atomic_load_consume, unless you
_must_ operate on large or unaligned objects.
=> Optimization: If downstream memory operations do not depend on
the value, then you can use atomic_load_relaxed.
- Translate Linux WRITE_ONCE into atomic_store_relaxed, unless you
_must_ operate on large or unaligned objects.
* How do they relate to existing atomic_ops(3) and membar_ops(3) API?
We're still working on details, but for now, you can treat
atomic_r/m/w(p, ...); // from atomic_ops(3), except the *_ni
membar_enter();
as a load-acquire, and
membar_exit();
atomic_r/m/w(q, ...); // from atomic_ops(3), except the *_ni
as a store-release. On architectures with __HAVE_ATOMIC_AS_MEMBAR,
such as x86, the membar_enter/exit is not necessary and every
atomic_r/m/w implies store-release _and_ load-acquire.
(Caveat parallel programmer: membar_enter is _not_ the same as C11
atomic_thread_fence(memory_order_acquire), and atomic_load_relaxed
followed by membar_enter() is _not_ a load-acquire, which is why this
is not the end of the story.)
Home |
Main Index |
Thread Index |
Old Index