Subject: x86 instructions reordering
To: None <>
From: Manuel Bouyer <>
List: port-i386
Date: 03/24/2005 16:22:33
can newer x86 CPUs (hyperthreaded p4 in my case) reorder instructions,
or memory writes ? If so, how can we impose barriers ? I didn't find
anything obvious in the x86 SMP code, beside bit atomic operations (which
don't work in my case).

Basically I have these 2 pieces of code in xen (NetBSD and linux), one sender
and one receiver, using a piece of shared memory.
The receiver:                          |     The sender:
handle_event()                         |     send()
{                                      |     {
                                       |             a = shared_memory->a;
again:                                 |             do_something;
        a = shared_memory->a;          |             wmb();
        __insn_barrier();              |             shared_memory->a = a + 1;
        b = shared_memory->b;          |             mb()
        while (b < a) {                |             if (shared_memory->b == a)
                /* do something */     |                     send_event();
                did_something = 1;     |     }
                b++;                   |
        }                              |
        __insn_barrier();              |
        shared_memory->b = b;          |
        __insn_barrier();              |
        if (did_something)             |
                goto again;            |
}                                      |

The sender is a piece of linux code, mb() and wmb() are both
__asm__ __volatile__ ("lock; addl $0,0(%%esp)": : :"memory")
which is the same as our x86_lfence(). I tried remplacing __insn_barrier
with x86_lfence but the assembly produced by gcc didn't change.

So basically, the sender send an event only if the receiver isn't already busy.
But sometimes, the receiver stops and isn't getting an event. The only way I
can see this happen is if the read and writes to memory
don't happen in the intended order. This problem only occurs if the
reader and writer are running on different CPUs of the HT P4. I couldn't
reproduce this if I force both virtual machine to run on the same CPU, while
it locks up quickly if each virtual machine runs on a different virtual CPU.

Any idea ?

