tech-kern: Fwd: Pentium Pro architecture (synchronization/speculative loads) bits

Subject: Fwd: Pentium Pro architecture (synchronization/speculative loads) bits
To: None <core@NetBSD.ORG>
From: Greg Earle <earle@isolar.Tujunga.CA.US>
List: tech-kern
Date: 05/06/1997 01:24:24
Hi folks,
	Not sure who would be most interested in this, but I saw this fly by
on the Plan 9 mailing list.  I suspect that most of the info that could
possibly be relevant to us would be of interest to anyone doing compiler work
(do we do any customization/knob-twiddling to gcc ourselves for our platforms?)
but at any rate, it might be worth filing away for future use nonetheless.

	- Greg

P.S. You can find these bits in DejaNews using a search string of 
     "Haertel & ~g (comp.os.plan9)"

------- Forwarded Message

From: presotto@plan9.bell-labs.com
Message-Id: <199704221935.PAA02595@cse.psu.edu>
To: 9fans@cse.psu.edu
Date: Tue, 22 Apr 1997 15:35:29 -0400
Subject: How the Pentium Pro really works
Reply-To: 9fans@cse.psu.edu

Here's 3 messages from Mike Haertel describing how the
Pentium Pro works vis a vis synchronization.  As he
forcefully point out, the problem isn't speculative
loads, its just queued stores.  As expected, surrounding
shared accesses with spin locks is sufficient.  Only
iffy operations like our current version of sleep/wakeup
have to be more carefully handled.

An interesting point is that the same model exists
on the Pentium.  However, the shorter pipelines and
buffers in the Pentium are less likely to exacerbate
the problem.  We were just lucky.

===================================

To: research.bell-labs.com!presotto
Subject: Pentium Pro and coherence
Date: Tue, 22 Apr 1997 01:15:56 -0700
From: Mike Haertel <mike@ducky.net>

In article <199704211614.MAA02731@cse.psu.edu>, you wrote:
> The Pro people have remained silent
> on the subject (we've sent email).

Hi, I am an architect at Intel.  Who did you send email to?
I'm surprised you got no response.  In any case, perhaps I
can clarify things a little.

> Of course, I could be totally wrong about the speculative reads and
> it may be the interlock instruction on the writer and not the
> reader that causes the processors to become coherent.

The caches are always coherent using an MESI protocol.  The real
problem is that not all written data in the system is in the cache(s).

The Pentium Pro's memory ordering model is called "processor ordering"
and is a formalization of the 486's semantics.  The 486 had
a write-through cache with write queue to memory which was
not snooped by loads on other processors.

Loosely speaking, this means the ordering of events originating
from any one processor in the system, as observed by other processors,
is always the same.  However, different observers are allowed
to disagree on the interleaving of events from two or more processors.

The PPro does speculative and out-of-order loads.  However,
it has a mechanism called the "memory order buffer" to ensure
that the above memory ordering model is not violated.  Load
and store instructions do not get retired until the processor
can prove there are no memory ordering violations in the actual
order of execution that was used.  Stores do not get sent to
memory until they are ready to be retired.  If the processor
detects a memory ordering violation, it discards all unretired
operations (including the offending memory operation) and
restarts execution at the oldest unretired instruction.

i.e. when a violation is detected the MOB whacks the machine ... :-)

For example, consider the following sequence:

	P1:     load (1000) -> reg      P2:     store 10 -> (1000)
		load (1000) -> reg              store 20 -> (1000)

Suppose on P1, the 2nd load speculatively executes first (for
whatever reason), and picks up 10 (the result of the first
store on P2).  Later, P2 executes the 2nd store (causing the
cached copy of location 1000 on P1 to be invalidated), and
finally P1 executes the 1st load.  At this point, P1 discovers
that a younger load has already read from the same location,
and that the location was subsequently invalidated by P2.  P1
says "a-ha!  that violates the memory ordering model!", clobbers
the speculative state of the machine from the offending
instruction (the 1st load) onward, and resumes execution
starting at the offending load.

Serializing instructions like CPUID force the machine to wait
until all queued stores have been written out.  (Actually,
serializing instructions force the machine to wait until they
are retired, but they cannot retire until all older stores
have retired, which has an effect equivalent to draining a
store queue.)  Note that serializing instructions do not
serialize the other processors, only the local processor.

You should be able to reproduce your bug by manually working
through the possible processor-ordering-consistent interleavings
of events from multiple processors.  Note that you should
think of a processor as also observing itself.

Finally, since the caches are actually fully coherent, you
should be able to do correct locking without too many serializing
instructions, perhaps without any.

Future Intel processors will implement the same memory
ordering model.

===================================

To: research.bell-labs.com!presotto
Subject: Re: Pentium Pro and coherence 
Date: Tue, 22 Apr 1997 09:24:42 -0700
From: Mike Haertel <mike@ducky.net>

> 0,0 blows us away.  If I understand correctly, putting a
> synchronizing instruction between the writes and subsequent read
> 
>	P1:				P2:
>	x = 0			y = 0
>	x = 1			y = 1
>	cpuid			cpuid
>	read y			read x
> 
> will cause the processor the instruction was executed on
> to wait until all processors have gotten out their
> queued stores and then blow away any inconsistencies on
> caused by speculative loads.

The cpuid waits only until the *local* processor has gotten
out its queued stores.  It doesn't wait for any of the other
processors.  However, in this example (where all processors
do cpuid before any processor does a load) I think you're OK.

The cpuid forces the local processor to wait until its queued
writes have been globally observed.  What this means is that
you are effectively serializing access to "the bus" (really,
the combination of the bus and the coherent caches--writes to
M-state cache lines on the local processor count as "globally
observed").  Some processor (say P2) is last to execute cpuid.
This means that P1 has already executed cpuid, therefore P1's
"x=1" has been globally observed, so P2's load is guaranteed
to see x=1.

Finally, I'd like to emphasize: The inconsistencies are NOT
caused by speculative loads, they are caused by queued writes
on other processors.

> What we need is that if the following sequence is executed
>
>	P1:				P2:
>	x = 0			y = 0
>	x = 1			y = 1
>	read y			read x
>
> has the values read will be one of
>
>	1	0
>	0	1
>	1	1
>
> 0,0 blows us away.

You could get 0,0 even on the 486 or Pentium.  The difference
is that the PPro has such deep pipelines and buffers that it
is more likely to expose such bugs.

===================================

To: research.bell-labs.com!presotto
Date: Tue, 22 Apr 1997 11:01:30 -0700
From: Mike Haertel <mike@ducky.net>

> Do you mind if I repost your mail to the 9fans list?

Sure, go ahead.

One other addendum I'd like to make: in your original post
to 9fans, you mentioned some paranoia about similar problems
possibly existing in other parts of the kernel.

One bit of reassurance: any data structure protected by a spin
lock is safe.  Here's why:

	P1				P2
	[already holding lock]		wait for lock->busy == 0
	store data->x			grab lock
	store data->y			use data->x and ->y
	lock->busy = 0

Because of processor ordering, when P2 observes lock->busy == 0,
it also has observed all prior stores by P1.  Hence P2 never gets
an inconsistent view of P1's updates.

This would not be the case if the Pentium Pro allowed speculative loads
to violate processor ordering semantics.

This is also probably not the case on other processors with
weaker memory ordering semantics.  Digital's Alpha may be one
such processor, I'm not sure.  On those processors, when releasing
a spin lock you need a "lock release" synchronization instruction
rather than a simple store.

------- End of Forwarded Message