Re: TSTILE, pmap, and bad analog numbers

To: port-macppc%netbsd.org@localhost
Subject: Re: TSTILE, pmap, and bad analog numbers
From: christos%astron.com@localhost (Christos Zoulas)
Date: Tue, 9 Jul 2013 19:27:22 +0000 (UTC)

In article <p06240802ce008b8caed2@[71.39.101.51]>,
Donald Lee  <MacPPC2%c.icompute.com@localhost> wrote:
>I want to summarize my progress in the struggle with macppc 6.0.x.
>
>I capitulated to the instability of macppc and moved the bulk of my web
>server load to an i386 machine (also running NetBSD 6.0.2).  This makes
>my config more complex, but the i386 machine has not crashed on me yet,
>and the PPC machine (production server) keeps waking me up in the middle
>of the night when it falls over.
>
>I don't *know* that the offloading of the web traffic will stop the crashes,
>but of all the crashes I've seen, the only common thread is that when the
>crashes occur, there is heavy load on the web server at the time.
>
>I constructed a test case on this basis, that fails with the GENERIC
>kernel, but I cannot seem to get a DEBUG/DIAGNOSTIC/LOCKDEBUG kernel
>to (usefully) fail.
>
>In sum, I have 3 nasty bugs.
>
>The first is rare, and not a big concern:
>Apr  7 20:26:24 mercy /netbsd: panic: pmap_pte_spill: victim p-pte
>(0x1ffc3e0) has no pvo entry!
>Apr  7 20:26:24 mercy /netbsd: cpu0: Begin traceback...
>Apr  7 20:26:24 mercy /netbsd: 0x10664e10: at panic+0x4c
>Apr  7 20:26:24 mercy /netbsd: 0x10664e50: at pmap_pte_spill+0x4c0
>Apr  7 20:26:24 mercy /netbsd: 0x10664e90: at trap+0x560
>Apr  7 20:26:24 mercy /netbsd: 0x10664f20: user DSI read trap @
>0xfdd201f4 by 0xfdfe6f5c: srr1=0x200d032
>Apr  7 20:26:24 mercy /netbsd:            r1=0xffffb040 cr=0x28002044
>xer=0x20000000 ctr=0 dsisr=0x40000000
>Apr  7 20:26:24 mercy /netbsd: cpu0: End traceback...
>Apr  7 20:26:24 mercy /netbsd: dumpsys: TBD
>
>I've only seen this a few times.
>
>
>The second is my nemesis.  It is the derad "TSTILE" problem.  It doesn't
>crash.  It just grinds to a halt.  Pings to the machine keep working.
>This is the one that seems to be caused by apache load.  It appears to
>be NOT in the ethernet driver, or even in the networking code at all,
>in that my test case that reproduces this runs with wget fetching things from
>the apache on the same machine through localhost. (Yes ,the network code
>gets exercised, but not the external drivers)  If I could get this one fixed,
>I could sleep.

If you can get into ddb when it tstiles, you can do ps to find the pids
of the processes that are stuck in tstile and then get a stack trace
for each of the processes that are in tstile. Perhaps this will tell us
which one started all of this. If you have LOCKDEBUG you can also show lock
with the lock address and find the history of the lock. I am afraid though
that what you see are symptoms of either a vm/pmap bug and/or some context
switch/MI issues.

christos

Follow-Ups:
- Re: TSTILE, pmap, and bad analog numbers
  - From: Donald Lee

References:
- TSTILE, pmap, and bad analog numbers
  - From: Donald Lee

Prev by Date: TSTILE, pmap, and bad analog numbers
Next by Date: Re: TSTILE, pmap, and bad analog numbers
Previous by Thread: TSTILE, pmap, and bad analog numbers
Next by Thread: Re: TSTILE, pmap, and bad analog numbers
Indexes:

Home | Main Index | Thread Index | Old Index