Re: TSTILE, pmap, and bad analog numbers

To: port-macppc%NetBSD.org@localhost
Subject: Re: TSTILE, pmap, and bad analog numbers
From: Donald Lee <MacPPC2%c.icompute.com@localhost>
Date: Tue, 9 Jul 2013 17:45:31 -0500

>In article <p06240802ce008b8caed2@[71.39.101.51]>,
>Donald Lee  <MacPPC2%c.icompute.com@localhost> wrote:
>>I want to summarize my progress in the struggle with macppc 6.0.x.
>>
>>I capitulated to the instability of macppc and moved the bulk of my web
>>server load to an i386 machine (also running NetBSD 6.0.2).  This makes
>>my config more complex, but the i386 machine has not crashed on me yet,
>>and the PPC machine (production server) keeps waking me up in the middle
>>of the night when it falls over.
>>
>>I don't *know* that the offloading of the web traffic will stop the crashes,
>>but of all the crashes I've seen, the only common thread is that when the
>>crashes occur, there is heavy load on the web server at the time.
>>
>>I constructed a test case on this basis, that fails with the GENERIC
>>kernel, but I cannot seem to get a DEBUG/DIAGNOSTIC/LOCKDEBUG kernel
>>to (usefully) fail.
>>
>>In sum, I have 3 nasty bugs.
>>
>>The first is rare, and not a big concern:
>>Apr  7 20:26:24 mercy /netbsd: panic: pmap_pte_spill: victim p-pte
>>(0x1ffc3e0) has no pvo entry!
>>Apr  7 20:26:24 mercy /netbsd: cpu0: Begin traceback...
>>Apr  7 20:26:24 mercy /netbsd: 0x10664e10: at panic+0x4c
>>Apr  7 20:26:24 mercy /netbsd: 0x10664e50: at pmap_pte_spill+0x4c0
>>Apr  7 20:26:24 mercy /netbsd: 0x10664e90: at trap+0x560
>>Apr  7 20:26:24 mercy /netbsd: 0x10664f20: user DSI read trap @
>>0xfdd201f4 by 0xfdfe6f5c: srr1=0x200d032
>>Apr  7 20:26:24 mercy /netbsd:            r1=0xffffb040 cr=0x28002044
>>xer=0x20000000 ctr=0 dsisr=0x40000000
>>Apr  7 20:26:24 mercy /netbsd: cpu0: End traceback...
>>Apr  7 20:26:24 mercy /netbsd: dumpsys: TBD
>>
>>I've only seen this a few times.
>>
>>
>>The second is my nemesis.  It is the derad "TSTILE" problem.  It doesn't
>>crash.  It just grinds to a halt.  Pings to the machine keep working.
>>This is the one that seems to be caused by apache load.  It appears to
>>be NOT in the ethernet driver, or even in the networking code at all,
>>in that my test case that reproduces this runs with wget fetching things from
>>the apache on the same machine through localhost. (Yes ,the network code
>>gets exercised, but not the external drivers)  If I could get this one fixed,
>>I could sleep.
>
>If you can get into ddb when it tstiles, you can do ps to find the pids
>of the processes that are stuck in tstile and then get a stack trace
>for each of the processes that are in tstile. Perhaps this will tell us
>which one started all of this. If you have LOCKDEBUG you can also show lock
>with the lock address and find the history of the lock. I am afraid though
>that what you see are symptoms of either a vm/pmap bug and/or some context
>switch/MI issues.
>
>christos

I can get there, and do the ps.  There are dozens (hundreds?) of processes
hung on tstile.  The vast majority are httpd.

I'll do some stack traces next time it fails.  I can get a failure reliably
with a GENERIC kernel (whether I build it or use the release) in a few hours.
I'm working on a better test case that might fail with a debug kernel.

Comparing the stack traces might reveal some clues.  I presume that most
of them will be identical, so I can focus on the ones that are "different".

I have some hope that the "bad numbers" and the tstile bugs may have
the same cause.  It's not likely, but I can hope. ;->

FWIW, my i386 6.0.2 web server is completely stable so far, and is now
serving (close to) the same load that my macppc server was serving, so this
looks like it is a macppc problem.

Thank you for your help,

-dgl-

Follow-Ups:
- Re: TSTILE, pmap, and bad analog numbers
  - From: Manuel Bouyer

References:
- TSTILE, pmap, and bad analog numbers
  - From: Donald Lee
- Re: TSTILE, pmap, and bad analog numbers
  - From: Christos Zoulas

Prev by Date: Re: TSTILE, pmap, and bad analog numbers
Next by Date: Re: TSTILE, pmap, and bad analog numbers
Previous by Thread: Re: TSTILE, pmap, and bad analog numbers
Next by Thread: Re: TSTILE, pmap, and bad analog numbers
Indexes:

Home | Main Index | Thread Index | Old Index