TSTILE, pmap, and bad analog numbers

To: port-macppc%NetBSD.org@localhost
Subject: TSTILE, pmap, and bad analog numbers
From: Donald Lee <MacPPC2%c.icompute.com@localhost>
Date: Mon, 8 Jul 2013 10:45:53 -0500

I want to summarize my progress in the struggle with macppc 6.0.x.

I capitulated to the instability of macppc and moved the bulk of my web
server load to an i386 machine (also running NetBSD 6.0.2).  This makes
my config more complex, but the i386 machine has not crashed on me yet,
and the PPC machine (production server) keeps waking me up in the middle
of the night when it falls over.

I don't *know* that the offloading of the web traffic will stop the crashes,
but of all the crashes I've seen, the only common thread is that when the
crashes occur, there is heavy load on the web server at the time.

I constructed a test case on this basis, that fails with the GENERIC
kernel, but I cannot seem to get a DEBUG/DIAGNOSTIC/LOCKDEBUG kernel
to (usefully) fail.

In sum, I have 3 nasty bugs.

The first is rare, and not a big concern:
Apr  7 20:26:24 mercy /netbsd: panic: pmap_pte_spill: victim p-pte (0x1ffc3e0) 
has no pvo entry!
Apr  7 20:26:24 mercy /netbsd: cpu0: Begin traceback...
Apr  7 20:26:24 mercy /netbsd: 0x10664e10: at panic+0x4c
Apr  7 20:26:24 mercy /netbsd: 0x10664e50: at pmap_pte_spill+0x4c0
Apr  7 20:26:24 mercy /netbsd: 0x10664e90: at trap+0x560
Apr  7 20:26:24 mercy /netbsd: 0x10664f20: user DSI read trap @ 0xfdd201f4 by 
0xfdfe6f5c: srr1=0x200d032
Apr  7 20:26:24 mercy /netbsd:            r1=0xffffb040 cr=0x28002044 
xer=0x20000000 ctr=0 dsisr=0x40000000
Apr  7 20:26:24 mercy /netbsd: cpu0: End traceback...
Apr  7 20:26:24 mercy /netbsd: dumpsys: TBD

I've only seen this a few times.


The second is my nemesis.  It is the derad "TSTILE" problem.  It doesn't
crash.  It just grinds to a halt.  Pings to the machine keep working.
This is the one that seems to be caused by apache load.  It appears to
be NOT in the ethernet driver, or even in the networking code at all,
in that my test case that reproduces this runs with wget fetching things from
the apache on the same machine through localhost. (Yes ,the network code
gets exercised, but not the external drivers)  If I could get this one fixed,
I could sleep.

The third is even nastier, in my opinion.  When I run the web statistics, I
use analog to generate the pretty web pictures.  Analog screws up the
numbers.  Instead of gigabytes, I get zottabytes and crazy totals
**intermittently**.  I narrowed this down to a single day, and probably
to a single checkin.  See my emailto the list 3/24/2013:
        Confidence: Chopping between 5.2 and 6.0.1

If the user code is intermittently screwing up its calculations, that means
that the CPU handling of context switches is not right somewhere.  I could
easily believe that dropping a register or two could easily cause lots more
problems than just bad numbers in analog.  I can see it causing both of the
problems above.

The kernel does context switches, too.

Progress:
I continue to try to reproduce the TSTILE problem on my test machine
with a non-GENERIC kernel.  If I get a good enough test case, I may
be able to provide some info to help fix it.  I would MUCH rather go
back to a single server, but I can't do that if it's crashing all the time.

Thanks all for the work you do.

-dgl-

Follow-Ups:
- Re: TSTILE, pmap, and bad analog numbers
  - From: Christos Zoulas

Prev by Date: Re: tstile lockups - no luck
Next by Date: Re: TSTILE, pmap, and bad analog numbers
Previous by Thread: Re: tstile lockups - no luck
Next by Thread: Re: TSTILE, pmap, and bad analog numbers
Indexes:

Home | Main Index | Thread Index | Old Index