port-i386: Re: Intel Atlantis motherboard cache woes

Subject: Re: Intel Atlantis motherboard cache woes
To: Kevin M. Lahey <kml@nas.nasa.gov>
From: Jason Thorpe <thorpej@nas.nasa.gov>
List: port-i386
Date: 06/22/1996 11:17:22
[ Kevin came by my office earlier and asked me about this, and I had
  only speculations.  But, I'm not an i386 gooro either, so I thought
  he should post to the list.  I will, however, share my speculations
  here with the group.  So, the disclaimer here is: "I don't know much
  about the x68 cache architecutre _at all_, so don't laugh too loudly
  if I look like an idiot."  :-)  --thorpej ]

On Fri, 21 Jun 1996 17:09:38 -0700 
 "Kevin M. Lahey" <kml@nas.nasa.gov> wrote:

 > When I run with the 512KB pipeline burst cache COAST module, 
 > I get pretty horrible cache access results.  In fact, I get the
 > same results whether or not I have the COAST module installed.
 > I've tried several different COAST modules with no success.

When the COAST is installed, does the BIOS recognize the cache?  If the 
results are the same, it sounds like it never gets enabled.

 > When I run with a 256KB pipeline burst cache COAST module,
 > the kernel panics with a page fault in supervisor mode,
 > usually after core dumping on the compiles that start up
 > lmbench.  It is a little more robust when I run it at 133 MHz
 > rather than 166 MHz, but it still seems to panic eventually.

Do you see messages like "data modified on free list", and the like?  For 
quick reference, see:

	http://www.netbsd.org/cgi-bin/query-full-pr?1416

...I filed that last April after having some problems with caches on a 
Pentium system.

 > Any clues?  Am I missing something obvious?

Well, here's my one speculation...this assumes that the cache is a 
write-back type (which Kevin and I honestly don't know, since neither the 
BIOS nor manual seemed willing to tell us...)

>From my experiences hacking on the SPARC port, I can't help but wonder if 
address translation is using stale copies of the page tables.

This sort of thing happened to me on my SS10 (when working with Paul and 
Aaron on some of the latter stages of getting the sun4m stuff ready for the 
tree).  The only ``solution'' (it's a hack, really) we could find was to 
cache-inhibit the page, segment, and region table pages (since the magic 
to tell the MMU to check the cache first didn't seem to be working on my 
somewhat quirky SS10).

Now, another bit of SPARC expierence, from getting DMA working on the 
sun4 `si' driver, tells me that one needs to be careful to flush the 
write-back cache (like the one on the sun4/200) before doing DMA, though 
the sun4 cache is virtually tagged, so handling it is going to be 
somewhat different (since the DMA hardware actually uses the address 
translation facilities of the sun MMU).

So, I guess my point is that we could be seeing a scenario like this:

	- process forks
	- new mappings for child get set up (we think we updated the
	  page tables properly, but the updates are in the w/b cache,
	  and haven't made it out to memory yet)
	- access occurs at the address where we think we have a valid
	  mapping
	- MMU attempts to translate that address, sees no valid mapping
	  cuz the page tables haven't updated
	- *poof*

Like I said earlier, I could really be showing my ignorance of how the 
cache and memory management hardware interact (if at all :-) on the x86, 
so take this with a grain of salt.  I'm just trying to provide food for 
thought...

Say, Kevin... If you have DDB in that kernel, type "trace" at the db> 
prompt, and jot it down... that could be helpful.

Ciao.

 -- save the ancient forests - http://www.bayarea.net/~thorpej/forest/ -- 
Jason R. Thorpe                                       thorpej@nas.nasa.gov
NASA Ames Research Center                               Home: 408.866.1912
NAS: M/S 258-6                                          Work: 415.604.0935
Moffett Field, CA 94035                                Pager: 415.428.6939