Port-amd64 archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Why does membar_consumer() do anything on x86_64?


Unless I'm truly confused, here's what membar_consumer() and membar_producer()
do on an x86_64 processor:

            addq    $0, -8(%rsp)

            /* A store is enough */
            movq    $0, -8(%rsp)

I'm trying to figure out why membar_consumer() does that, since the useless
read-modify-write is measurably quite expensive.  I'm also curious why
membar_producer() is implemented as the useless write.

The man page for membar_ops says this about membar_consumer():

      All loads preceding the memory barrier will complete before any
      loads after the memory barrier complete.

The Intel Software Developer's Manual, in section 8.2, says this about
the ordering of loads

      Reads are not reordered with other reads.

and the AMD manual says the same thing, so it seems that even if 
did nothing at all it would still be the case that "all loads preceding the 
barrier will complete before any loads after the memory barrier complete" since
the processors guarantee that all loads are done in program order.  So it isn't
clear to me what membar_consumer() accomplishes by doing the fairly expensive
operation it actually does.

As for membar_producer(), the man page says its purpose is to accomplish this:

     All stores preceding the memory barrier will reach global visibility
     before any stores after the memory barrier reach global visibility.

but the Intel manual says this about write ordering:

     Writes to memory are not reordered with other writes, with
     the following exceptions:

     - writes executed with the CLFLUSH instruction;
     - streaming stores (writes) executed with the non-temporal move 
     - string operations (see Section

It seems to me that membar_producer() could also do nothing at all if it
weren't for those exceptions (and if we care about the exceptions), but I
can't find anything in the Intel or AMD manual to suggest that doing a
single store somehow fixes those exceptions.  What does the store do?

The reason I'm asking is that I'm implementing a route lookup data structure
(e.g. for a kernel routing table) which allows routes to be added and deleted
while forwarding lookups are concurrently being performed, and I'm still
undecided about whether any processor architecture needs membar_consumer()
calls in the lookup function (I suspect not, but I still need to think about
it).  What I do know for sure, however, is that membar_consumer() calls
are not needed for Intel/AMD processors, since the processors' guarantees about
read and write ordering are more than sufficient by themselves, yet if I need
to add them for other architectures their presence more than doubles the
time it takes to do a route lookup on Intel machines (i.e. it spends more
time in member_consumer() than it does actually doing the route lookup).
Since I know I don't need membar_consumer() to do anything, and since its
definition doesn't seem to require it to do anything on Intel processors,
I'm wondering why it does what it does.

Dennis Ferguson

Home | Main Index | Thread Index | Old Index