Subject: Re: "pmap_unwire: wiring ... didn't change!"
To: Chuck Silvers <chuq@chuq.com>
From: Markus W Kilbinger <kilbi@rad.rwth-aachen.de>
List: port-cobalt
Date: 02/13/2005 21:16:29
>>>>> "Chuck" == Chuck Silvers <chuq@chuq.com> writes:

    >> I've applied your patch and I can confirm vanishing of the
    >> "pmap_unwire: ..." messages (so far, 2 hours now).

    Chuck> cool.

(... 9.5 hours now ;-))

    >> But I still see (my?) data corruption problem:

    Chuck> that sounds like a CPU cache problem to me too, probably in
    Chuck> bus_dma or the cache-flushing code itself. if it's
    Chuck> happening during writes to disk rather than reads from disk
    Chuck> then it's probably in the cache write-back part rather than
    Chuck> the cache invalidate part. I didn't see anything in a brief
    Chuck> look at the code, though.

My mentioned tests, where I can reproduce the data corruption
certainly, involve disk access; _reading_ large data amounts from disk
is enough to get a corruption.

Once I tested my qube2's RAM with pkgsrc/sysutils/memtester where no
errors were reported.

I did not notice any data corruption if using my qube2 in routing
data, but I did no selective stress testing on that.

    Chuck> I don't see this problem on my R4400 pmax 5000/260, so it's
    Chuck> likely specific to either the RM5200/R5000 or systems with
    Chuck> no L2 cache. do RM5200/R5000 systems with L2 cache (eg.
    Chuck> sgimips) see this? do we support any other MIPS variants
    Chuck> with no L2 cache? maybe some of the hpcmips doodads?

    Izumi> My R5000 O2 with enabled L2 cache also has the same problem.

That's some kind of good news for me! I was still considering some
hardware issues of my own qube2...

    Chuck> from the RM5200 manual, it looks like it's possible to use
    Chuck> the cache in write-through mode instead of write-back mode,
    Chuck> it might be worthwhile to try doing that as an experiment
    Chuck> to narrow down the problem.

    Izumi> I guess there is something wrong around r5k cache code
    Izumi> but I can't find any particular problem in cache_r5k.c
    Izumi> when I looked at (but I could be wrong).

Hmm, if the problems occurs on quite different hardware, just having
the same mips CPU type, (common) r5k cache handling seems really to be
the most probable cause of the corruption (correct?). Or ist bus_dma
still a candidate?

How to narrow it down!?

Sorry that I'm absolutely not familiar with this kind of kernel
programming, but I can do any kind of testing, of course!

Markus.