tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: __{read,write}_once

Maxime Villard <> wrote:
> Alright, thanks.
> Do you think there is a performance degradation with using explicitly
> atomic operations (with the "lock" prefix on x86), compared to just using
> an aligned volatile which may not be exactly atomic in that sense (even
> if it happens to be on x86)?

Yes, interlocked atomic operations have considerable performance overhead
and should be treated as relatively expensive.  Whether it is significant,
of course, depends where and how they are used in the code.

You can find various measurements of the atomic operations on the Internet,
but they can differ quite a lot amongst various CPU models.

Also, just to be clear here: the atomic loads/stores (both the C11 routines
and READ_ONCE/WRITE_ONCE) are *not* interlocked operations.

> Typically in sys/uvm/uvm_fault.c there are several lockless stat
> increments like the following:
> 	/* Increment the counters.*/
> 	uvmexp.fltanget++;
> In your (not just rmind@, but tech-kern@) opinion, is it better to switch
> to: atomic_inc_uint(&uvmexp.fltanget);
> or to
> 	__add_once(uvmexp.fltanget, 1);
> as I did in my patch? Considering that the latter may possibly not be
> atomic on certain arches, which could possibly result in garbage when the
> value is read.

In this case it is assumed that inaccurate statistics is better than the
expense of atomic operations.  That is the reason why they are non-atomic
increments, although to prevent from garbage (not only the losses of adds),
they should use atomic loads/stores (or just volatile for now) on a word.

Generally, the right solution to this problem is to convert these counters
to be per-CPU (or per-thread in userspace applications) and aggregate them
when needed (e.g. on fetching the stats).  See percpu(9) API.  This way you
get both the performance and accuracy/correctness.

> I've committed it right away, because this one seemed clear enough, feel
> free to comment or improve if you have a better idea:

It would be better to find a solution which doesn't remove the optimisation
for the fast path.  At least on 64-bit architectures, a relax atomic load
(as 'volatile' for now) on xc->xc_donep could stay.


Home | Main Index | Thread Index | Old Index