port-cobalt: Re: "pmap_unwire: wiring ... didn't change!"

Subject: Re: "pmap_unwire: wiring ... didn't change!"
To: Markus W Kilbinger <kilbi@rad.rwth-aachen.de>
From: Chuck Silvers <chuq@chuq.com>
List: port-cobalt
Date: 02/14/2005 18:46:49
On Mon, Feb 14, 2005 at 09:37:40PM +0100, Markus W Kilbinger wrote:
> >>>>> "Chuck" == Chuck Silvers <chuq@chuq.com> writes:
> 
>     >> My mentioned tests, where I can reproduce the data corruption
>     >> certainly, involve disk access; _reading_ large data amounts
>     >> from disk is enough to get a corruption.
> 
>     Chuck> so you get different corruption when you read the same file
>     Chuck> at different times? that's useful to know.
> 
> Yes: A (large) file that was intact a first time seems to be corrupt
> when reading it later a second time (and vice versa).

ok, so it's clear the corruption can happen when reading from disk.
can you tell if the file is ever corrupted while being written to disk?
you'll need to use a pattern generator to create some files (rather than
reading them from disk).  ideally you'd check the data by reading it on
some other platform that doesn't have this bug, but I'd think that if
all of the data can be read back correctly on the qube at least part of
the time, then it's likely that the corruption is only on the read side.


> For me it seems I can diminish these corrutions by putting some other
> load on the qube2. On the other hand pure disk access (e. g. untarring
> new base.tgz :-/)) produces quite certainly some corrupt files
> (libc.so... :-(). To workaround the latter I run 'nice pax -zvrpe ...'
> over a ssh connection, so that pax's '-v' vorbose output produces some
> additional load which prevents most file corruptions (not all: some,
> especially larger files might still get corrupted).

that makes sense.  the other activity is likely to reuse the cache lines
containing the stale data that we are failing to invalidate, reducing the
chance that you'll see the problem.


>     >> Hmm, if the problems occurs on quite different hardware, just
>     >> having the same mips CPU type, (common) r5k cache handling
>     >> seems really to be the most probable cause of the corruption
>     >> (correct?). Or ist bus_dma still a candidate?
> 
>     Chuck> could be either, we don't know yet. the various versions of
>     Chuck> the bus_dma code for all the MIPS3 platforms are pretty
>     Chuck> similar.
> 
> ...despite the fact that the same bus_dma code works for/on your R4k4?

well, it's not exactly the same, and it could be an interaction that
only shows up with the r5k cache.  it does seem likelier that it's the
r5k cache code though.


>     Chuck> FYI, I'm probably not going to have time to pursue this
>     Chuck> cache problem soon, so hopefully one of the other MIPS guys
>     Chuck> can run with it.
> 
> Maybe I should send-pr now (more precisely)?

please, that would be great.

-Chuck