Port-amd64 archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Why does membar_consumer() do anything on x86_64?

On 22 Jul 2010, at 02:39 , Manuel Bouyer wrote:

> On Wed, Jul 21, 2010 at 05:48:43PM -0700, Dennis Ferguson wrote:
>>> Actually, that's not what I observed while working on Xen rings.
>>> loads *can* be reordered (because of speculative loads). I noticed this
>>> on various, post-ppro CPUs.
>> If you don't use the result of the speculative
>> load it doesn't matter what order it was read in, but if you do use the
>> result the processor can still ensure that the speculative load occurred
>> in program order.
> Actually, I had to use memory barrier in Xen code to get correct
> result. And this was on CPUs up to core 2.

I don't doubt at all that you fixed a bug by inserting a memory barrier.
What I would say, however, is that if you fixed this by inserting a
call to membar_consumer() then you still have a bug since the reordering
of loads against other loads is not something that x86 processors do,
but that's the only thing that membar_consumer() promises to protect

Of course it is still quite possible that a call to membar_consumer()
fixed your problem.  The thing is that the current implementation
of membar_consumer() for the x86 does more than cause subsequent loads
to be ordered with respect to previous loads (which is all the man page
promises, and which x86 processors do on their own anyway), it also
prevents load/store and store/load misordering, and the x86 does do
the latter.  If what you had was really a store/load problem then the
call to membar_consumer() on the x86 will fix it, but you'll still
have a bug since membar_consumer()'s man page description doesn't
promise to fix it.

This is precisely what I'm complaining about.  membar_consumer() does
more than its man page says, and is relatively expensive to call.  In
code on the x86 which only needs to protect against load/load
misordering the current implementation of membar_consumer() is a waste.
And as your example may suggest it is also a portability problem since code
which works fine on the x86 with a call to membar_consumer() inserted
in MI code may not work if other architectures implement membar_consumer()
to only do what the man page says it should do.  membar_consumer() on the x86
shouldn't protect against anything other than load/load misordering
(i.e. be a nop) because otherwise it costs more on the x86 and potentially
hides bugs that other ports may have to find.

>>       In this context:
>>       -  Loads do not pass previous loads (loads are not re-ordered). Stores
>>          do not pass previous stores (stores are not re-ordered)
>>          In the examples below all memory values are initialized to zero.
>>          Processor 0         Processor 1
>>          Store A ? 1        Load B
>>          Store B ? 1        Load A
>>          Load A cannot read 0 when Load B reads 1.
> You stopped at the first example. Later there is, for example:

The reason I stopped here is that this is the only example relevant
to membar_consumer() (only orders loads against loads) and membar_producer()
(only orders stores against stores).

Look, in machine independent NetBSD code the above example *must* be
implemented like this

            Processor 0            Processor 1
            a = 1;                 my_b = b;
            membar_producer();     membar_consumer()
            b = 1;                 my_a = a;

to ensure that (my_a == 0 && my_b == 1) is never true.  According to the
plain language above, however, on the x86 (my_a == 0 && my_b == 1)
will never be true even if membar_producer() and membar_consumer() do
nothing (well, they must do something to keep gcc from reordering those
operations, but needn't generate any CPU instructions).  My assertion is not
only that membar_producer() and membar_consumer() may do nothing in this case,
but that they *should* do nothing in this case.  This is the only case
which the man page descriptions of membar_consumer() and membar_producer()
promise to fix, so having those functions do more than necessary on the
x86 makes x86 code more expensive than necessary, and may also hide bugs
on the x86 related to other types of misordering which other ports,
whose implementations of those functions adhere more closely to the man page
descriptions, will have a problem with.

> - Non-overlapping Loads may pass stores.
>                   Processor 0                      Processor 1
>                  Store A  1                       Store B  1
>                  Load B                           Load A
> All combinations of Load A and Load B values are allowed. Where sequential
> consistency is needed (for example in Dekker's algorithm for mutual 
> exclusion),
> an MFENCE instruction should be used between the store and the subsequent
> load, or a locked access, such as LOCK XCHG, should be used for the store.

But you've picked an example that membar_consumer() explicitly does not
promise to fix!  member_consumer() only promises to order loads against
previous loads, it says nothing about ordering loads against previous
stores; for that you want membar_enter() (or membar_sync()), which must
indeed implement a memory barrier on the x86.

The fact that the current implementation of membar_consumer() on the x86
will fix the above anyway is the problem.  Not only does putting a memory
barrier in membar_consumer() on the x86 make it unnecessarily expensive,
it also causes membar_consumer() to fix the above problem on the x86 which
can lead you to believe you've actually fixed it when what you've instead
done is left a bug behind for other ports to find (or to "fix" by making
their implementation of membar_consumer() do more than it should too).

If you think this is incorrect, please produce an load/load sequence which
you think requires membar_consumer() to have a memory barrier, i.e. which
isn't guaranteed to work right without it by the first example above.  I
believe there aren't any, the first example covers all valid uses of

>> Both AMD and Intel gave this revelation about how their processors
>> actually worked in 2007, and fixed up their manuals sometime after that.
>> If the example above is always true without memory barriers then
>> for SMP programming membar_producer() and membar_consumer() can be
>> nops.  Linux has run with their equivalents of the above doing nothing
>> for 2.5 years now, and I've been running my NetBSD amd64 and i386
>> development machines with kernels with membar_producer() and
>> membar_consumer() nop'd for a couple of weeks now with no ill effects
>> that I've seen so far.
> The Xen issues with missing barrier were very hard to reproduce,
> and showed up only on havily loaded systems, and only once in a while.

I certainly believe that, and I wouldn't be surprised if I eventually
find bugs in my systems as a result of that change as well.

But that's not the point.  The point is that if you "fixed" your problem
with a call to membar_consumer() then, if your problem wasn't compiler
reordering of the operations (which doesn't require a memory barrier to
fix), you have certainly not really fixed it since you are relying on
behaviour of the x86 membar_consumer() which its man page says it need not
have.  To really fix the problem you should have used a different function.
For me the over-constrained behaviour of membar_consumer() is a problem since
it makes some code expensive on the x86 when it should not be.
membar_consumer() on the x86 should do only what its man page says
it should do, and code which fails to work when it does should replace
it with a call to a function appropriate for the reordering problem
it is actually having (which won't be a load/load problem since
Intel and AMD x86 CPUs don't have that problem: "loads are not reordered"
and "stores are not reordered", followed immediately by a 2-processor
example of this, just can't be interpreted any other way).

Dennis Ferguson

Home | Main Index | Thread Index | Old Index