[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: TSTILE, pmap, and bad analog numbers
>In article <email@example.com>,
>Donald Lee <MacPPC2%c.icompute.com@localhost> wrote:
>>I want to summarize my progress in the struggle with macppc 6.0.x.
>>I capitulated to the instability of macppc and moved the bulk of my web
>>server load to an i386 machine (also running NetBSD 6.0.2). This makes
>>my config more complex, but the i386 machine has not crashed on me yet,
>>and the PPC machine (production server) keeps waking me up in the middle
>>of the night when it falls over.
>>I don't *know* that the offloading of the web traffic will stop the crashes,
>>but of all the crashes I've seen, the only common thread is that when the
>>crashes occur, there is heavy load on the web server at the time.
>>I constructed a test case on this basis, that fails with the GENERIC
>>kernel, but I cannot seem to get a DEBUG/DIAGNOSTIC/LOCKDEBUG kernel
>>to (usefully) fail.
>>In sum, I have 3 nasty bugs.
>>The first is rare, and not a big concern:
>>Apr 7 20:26:24 mercy /netbsd: panic: pmap_pte_spill: victim p-pte
>>(0x1ffc3e0) has no pvo entry!
>>Apr 7 20:26:24 mercy /netbsd: cpu0: Begin traceback...
>>Apr 7 20:26:24 mercy /netbsd: 0x10664e10: at panic+0x4c
>>Apr 7 20:26:24 mercy /netbsd: 0x10664e50: at pmap_pte_spill+0x4c0
>>Apr 7 20:26:24 mercy /netbsd: 0x10664e90: at trap+0x560
>>Apr 7 20:26:24 mercy /netbsd: 0x10664f20: user DSI read trap @
>>0xfdd201f4 by 0xfdfe6f5c: srr1=0x200d032
>>Apr 7 20:26:24 mercy /netbsd: r1=0xffffb040 cr=0x28002044
>>xer=0x20000000 ctr=0 dsisr=0x40000000
>>Apr 7 20:26:24 mercy /netbsd: cpu0: End traceback...
>>Apr 7 20:26:24 mercy /netbsd: dumpsys: TBD
>>I've only seen this a few times.
>>The second is my nemesis. It is the derad "TSTILE" problem. It doesn't
>>crash. It just grinds to a halt. Pings to the machine keep working.
>>This is the one that seems to be caused by apache load. It appears to
>>be NOT in the ethernet driver, or even in the networking code at all,
>>in that my test case that reproduces this runs with wget fetching things from
>>the apache on the same machine through localhost. (Yes ,the network code
>>gets exercised, but not the external drivers) If I could get this one fixed,
>>I could sleep.
>If you can get into ddb when it tstiles, you can do ps to find the pids
>of the processes that are stuck in tstile and then get a stack trace
>for each of the processes that are in tstile. Perhaps this will tell us
>which one started all of this. If you have LOCKDEBUG you can also show lock
>with the lock address and find the history of the lock. I am afraid though
>that what you see are symptoms of either a vm/pmap bug and/or some context
I can get there, and do the ps. There are dozens (hundreds?) of processes
hung on tstile. The vast majority are httpd.
I'll do some stack traces next time it fails. I can get a failure reliably
with a GENERIC kernel (whether I build it or use the release) in a few hours.
I'm working on a better test case that might fail with a debug kernel.
Comparing the stack traces might reveal some clues. I presume that most
of them will be identical, so I can focus on the ones that are "different".
I have some hope that the "bad numbers" and the tstile bugs may have
the same cause. It's not likely, but I can hope. ;->
FWIW, my i386 6.0.2 web server is completely stable so far, and is now
serving (close to) the same load that my macppc server was serving, so this
looks like it is a macppc problem.
Thank you for your help,
Main Index |
Thread Index |