NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: port-alpha/38335 (kernel freeze on alpha MP system)
The following reply was made to PR port-alpha/38335; it has been noted by GNATS.
From: Jarle Greipsland <jarle%uninett.no@localhost>
To: gnats-bugs%NetBSD.org@localhost, mhitch%lightning.msu.montana.edu@localhost
Cc: port-alpha-maintainer%netbsd.org@localhost,
gnats-admin%netbsd.org@localhost,
netbsd-bugs%netbsd.org@localhost
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 18 Oct 2009 17:20:33 +0200 (CEST)
"Michael L. Hitch" <mhitch%lightning.msu.montana.edu@localhost> writes:
> Yes - that patch just attempts to work around a duplicate pool cache
> entry. I have not been able to replicate this on my ES40, and don't have
> remote power capability on my CS20 which makes it harder to debug when
> things can hang hard.
Some more results:
I ran a few more 'build.sh -j4' jobs without changing anything,
and the kernel continued to emit spurious "Whoa! pool_cache_get
returned an in-use entry!" messages.
Then, on Oct 14, I synced my source tree to -current again (this
might have been a stupid move on my part...), and built and
booted a new GENERIC.MP kernel. After about an hour into a
'build.sh -j4' job, it paniced with a new (to me) message:
panic: pool_get: pmaptlb: page empty
Stopped in pid 25249.1 (cc1plus) at netbsd:cpu_Debugger+0x4: ret
zero,(ra)
db{0}> tr
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x268
pool_get() at netbsd:pool_get+0x408
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x1ac
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x168
pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0xe8
pmap_kenter_pa() at netbsd:pmap_kenter_pa+0xe8
uvm_km_alloc() at netbsd:uvm_km_alloc+0x170
kmem_backend_alloc() at netbsd:kmem_backend_alloc+0x48
vmem_xalloc() at netbsd:vmem_xalloc+0x318
vmem_alloc() at netbsd:vmem_alloc+0x90
kmem_poolpage_alloc() at netbsd:kmem_poolpage_alloc+0x48
pool_grow() at netbsd:pool_grow+0x64
pool_get() at netbsd:pool_get+0x5c
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x1ac
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x168
kmem_alloc() at netbsd:kmem_alloc+0xc8
amap_extend() at netbsd:amap_extend+0x1b8
uvm_mapent_trymerge() at netbsd:uvm_mapent_trymerge+0x3dc
amap_copy() at netbsd:amap_copy+0x410
uvm_fault_internal() at netbsd:uvm_fault_internal+0xbb0
trap() at netbsd:trap+0x548
XentMM() at netbsd:XentMM+0x20
--- memory management fault (from ipl 0) ---
--- user mode ---
db{0}> show reg
v0 0x6
t0 0x1
t1 0x1
t2 0xfffffc003ff48000
t3 0
t4 0
t5 0xfffffc0000b46a2d __func__.21238+0x91c
t6 0
t7 0
s0 0xfffffc0000c378e0 msgbufenabled
s1 0x104
s2 0xfffffc0000c350a8 db_onpanic
s3 0xfffffc0000c6b8e8 pmap_tlb_shootdown_job_cache+0x1a8
s4 0
s5 0x2
s6 0
a0 0
a1 0xfffffd01fc0003f8
a2 0
a3 0x8
a4 0x3
a5 0xfffffc0000000008
t8 0
t9 0x8
t10 0xfffffc0000c941a8 uvmexp
t11 0
ra 0xfffffc000080c5a8 panic+0x268
t12 0xfffffc00003eb590 cpu_Debugger
at 0x1
gp 0xfffffc0000c30928
__link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp 0xfffffe0013c7f5e8
pc 0xfffffc00003eb594 cpu_Debugger+0x4
ps 0
ai 0
pv 0xfffffc00003eb590 cpu_Debugger
netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
21308> 1 7 1 40004 fffffc0024916c00 cc1plus
9441 1 3 0 84 fffffc0024916000 alpha--netbsd-c+ wait
21710 1 3 0 84 fffffc0024813000 sh wait
26347 1 2 0 40004 fffffc002de34800 cc1plus
18814 1 3 0 84 fffffc0024813400 alpha--netbsd-c+ wait
24418 1 3 0 84 fffffc002daf2800 sh wait
25249> 1 7 0 4 fffffc002d29e400 cc1plus
19917 1 3 0 84 fffffc0024917c00 alpha--netbsd-c+ wait
2263 1 3 1 84 fffffc002daf3000 sh wait
16091 1 2 0 40004 fffffc002caa9800 cc1plus
19836 1 3 0 84 fffffc003edac400 alpha--netbsd-c+ wait
7537 1 3 1 84 fffffc00233e9400 sh wait
18485 1 3 0 84 fffffc0024d52000 nbmake select
8826 1 3 0 84 fffffc002daf2000 sh wait
28933 1 3 1 84 fffffc002cab8000 nbmake select
15439 1 3 1 84 fffffc0024812000 sh wait
22920 1 3 1 84 fffffc0024917800 nbmake select
16006 1 3 1 84 fffffc0024d53400 sh wait
22712 1 3 1 84 fffffc0024d52800 nbmake select
28515 1 3 1 84 fffffc002caa9c00 sh wait
24901 1 3 1 84 fffffc00233e8800 nbmake select
22708 1 3 1 84 fffffc0024917000 pickup kqueue
20418 1 3 0 84 fffffc002d29ec00 bash ttyraw
17752 1 3 0 84 fffffc002caa9400 xterm select
29471 1 3 1 84 fffffc003f678000 csh pause
8486 1 3 0 84 fffffc002cab8c00 sshd select
23567 1 3 1 84 fffffc002d29f400 sshd netio
4106 1 3 0 84 fffffc00233e8000 sh wait
3271 1 3 1 84 fffffc00233e8c00 nbmake select
3779 1 3 0 84 fffffc0024916400 sh wait
3702 1 3 1 84 fffffc00233e9000 nbmake select
3935 1 3 1 84 fffffc003ca9e400 sh wait
3603 1 3 1 84 fffffc003ca9f000 nbmake select
750 1 3 0 84 fffffc003ca9e800 sh wait
730 1 3 1 84 fffffc003ca9ec00 tail kqueue
733 1 3 0 84 fffffc003f678400 tcsh pause
507 1 3 0 84 fffffc003ca9f800 tcsh pause
40 1 3 1 84 fffffc003ca9fc00 tcsh pause
41 1 3 1 84 fffffc003edad000 xterm select
459 1 3 0 84 fffffc003f9fe800 csh pause
473 1 3 1 84 fffffc003edac800 sshd select
528 1 3 0 84 fffffc003f960000 sshd netio
501 1 3 1 84 fffffc003f679800 qmgr kqueue
350 1 3 1 84 fffffc003f961c00 getty ttyraw
484 1 3 1 84 fffffc003f679400 cron nanoslp
498 1 3 1 84 fffffc003edac000 inetd kqueue
468 1 3 1 84 fffffc003edacc00 master kqueue
300 1 3 0 84 fffffc003edad400 sshd select
301 1 3 0 84 fffffc003edad800 ntpd pause
183 1 3 0 84 fffffc003edadc00 mount_mfs mfsidl
152 1 3 1 84 fffffc003f679000 syslogd kqueue
1 1 3 0 84 fffffc003f961000 init wait
0 39 3 0 204 fffffc003f678800 physiod physiod
0 38 3 1 204 fffffc003f960400 vmem_rehash vmem_rehash
0 37 3 0 204 fffffc003f960800 aiodoned aiodoned
0 36 3 1 204 fffffc003f9fe400 ioflush syncer
0 35 3 1 204 fffffc003f9fe000 pgdaemon pgdaemon
0 34 3 0 204 fffffc003f961400 cryptoret crypto_wait
0 33 3 0 204 fffffc003f961800 atapibus0 sccomp
0 31 3 0 204 fffffc003f960c00 unpgc unpgc
0 22 3 1 204 fffffc003f9fec00 atabus1 atath
0 21 3 0 204 fffffc003f9ff000 atabus0 atath
0 20 3 1 204 fffffc003f9ff400 scsibus0 sccomp
0 19 3 1 204 fffffc003f9ff800 xcall/1 xcall
0 18 1 1 204 fffffc003f9ffc00 softser/1
0 17 1 1 204 fffffc003fcf8000 softclk/1
0 16 1 1 204 fffffc003fcf8400 softbio/1
0 15 1 1 204 fffffc003fcf8800 softnet/1
0 14 1 1 205 fffffc003fcf8c00 idle/1
0 13 3 1 204 fffffc003fcf9000 pmfsuspend pmfsuspend
0 12 3 0 204 fffffc003fcf9400 pmfevent pmfevent
0 11 3 1 204 fffffc003fcf9800 nfssilly nfssilly
0 10 3 0 204 fffffc003fcf9c00 cachegc cachegc
0 9 3 1 204 fffffc003fe28000 vrele vrele
0 8 3 1 204 fffffc003fe28400 modunload modunload
0 7 3 0 204 fffffc003fe28800 xcall/0 xcall
0 6 1 0 204 fffffc003fe28c00 softser/0
0 5 1 0 204 fffffc003fe29000 softclk/0
0 4 1 0 204 fffffc003fe29400 softbio/0
0 3 1 0 204 fffffc003fe29800 softnet/0
0 2 1 0 205 fffffc003fe29c00 idle/0
0 1 3 0 204 fffffc0000c16660 swapper schedule
db{0}> mach cpu 1
Using CPU 1
db{0}> tr
CPU 0: fatal kernel trap:
CPU 0 trap entry = 0x2 (memory management fault)
CPU 0 a0 = 0xffffffffffffffd9
CPU 0 a1 = 0x1
CPU 0 a2 = 0x0
CPU 0 pc = 0xfffffc00003ee944
CPU 0 ra = 0xfffffc00003e8104
CPU 0 pv = 0xfffffc00003ee890
CPU 0 curlwp = 0xfffffc002d29e400
CPU 0 pid = 25249, comm = cc1plus
Caught exception in ddb.
db{0}> show reg
v0 0x1
t0 0x5b93a000
t1 0x2000
t2 0
t3 0x2000
t4 0xf
t5 0x16055e050
t6 0x2
t7 0
s0 0x2
s1 0x5b93a000
s2 0x161214008
s3 0x161214000
s4 0x16055e1d0
s5 0x16055e040
s6 0x50
a0 0x5b93a000
a1 0x48
a2 0
a3 0
a4 0x5
a5 0x5
t8 0x160814400
t9 0x3
t10 0x120582e58
t11 0x120581c28
ra 0x16063d168
t12 0x16064db44
at 0x2
gp 0x1606fb790
sp 0x1
pc 0x16064db70
ps 0x8
ai 0x120581c28
pv 0x16064db44
0x16064db70: and t1,#0xff,t4
I also got a kernel core dump from this one. Please give some
instructions if you want me to dig up some data from the dump.
Later on, after booting the same kernel, I got some more
"Whoa! pool_cache_get returned an in-use entry!" messages. Also
at some point (I don't think it correlated in time with any of
the kernel messages) the system hung hard again. I pressed the
halt button, and entered DDB:
halted CPU 0
CPU 1 is not halted
halt code = 1
operator initiated halt
PC = fffffc00005f6320
P00>>>cont
continuing CPU 0
CP - RESTORE_TERM routine to be called
panic: user requested console halt
Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4: ret
zero,(ra
)
db{0}> trace
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x268
console_restart() at netbsd:console_restart+0x78
XentRestart() at netbsd:XentRestart+0x90
--- console restart (from ipl 2) ---
mutex_spin_enter() at netbsd:mutex_spin_enter+0x200
pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x138
pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x168
pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
interrupt() at netbsd:interrupt+0x88
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 4) ---
mutex_exit() at netbsd:mutex_exit+0x10
pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x6c
pool_reclaim() at netbsd:pool_reclaim+0x68
pool_drain_end() at netbsd:pool_drain_end+0x44
uvm_pageout() at netbsd:uvm_pageout+0x740
exception_return() at netbsd:exception_return
--- root of call graph ---
db{0}> show reg
v0 0xfffffc0000c0de28 cpu_info_primary
t0 0x1
t1 0x1
t2 0xfffffc003ff48000
t3 0
t4 0
t5 0xfffffc0000b46a2d __func__.21238+0x91c
t6 0xfffffc002e386888
t7 0
s0 0xfffffc0000c378e0 msgbufenabled
s1 0x104
s2 0xfffffc0000c350a8 db_onpanic
s3 0xfffffc0000c8ef18 panicstr
s4 0
s5 0xfffffc0000808e48 pool_cache_put_slow+0x138
s6 0
a0 0
a1 0xfffffd01fc0003f8
a2 0
a3 0x8
a4 0x3
a5 0xfffffc0000000008
t8 0xb
t9 0x8
t10 0x7fff
t11 0
ra 0xfffffc000080c5a8 panic+0x268
t12 0xfffffc00003eb590 cpu_Debugger
at 0x160328000
gp 0xfffffc0000c30928
__link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp 0xfffffe00130e5908
pc 0xfffffc00003eb594 cpu_Debugger+0x4
ps 0x7
ai 0
pv 0xfffffc00003eb590 cpu_Debugger
netbsd:cpu_Debugger+0x4: ret zero,(ra)
db{0}> mach cpu 1
Using CPU 1
db{0}> trace
CPU 0: fatal kernel trap:
CPU 0 trap entry = 0x2 (memory management fault)
CPU 0 a0 = 0xffffffffffffffd9
CPU 0 a1 = 0x1
CPU 0 a2 = 0x0
CPU 0 pc = 0xfffffc00003ee944
CPU 0 ra = 0xfffffc00003e8104
CPU 0 pv = 0xfffffc00003ee890
CPU 0 curlwp = 0xfffffc003f9fe800
CPU 0 pid = 0, comm = system
Caught exception in ddb.
db{0}> c
cpu1: shutting down...
dumping to dev 8,1 offset 2097597
dump i/o error
sd0(esiop0:0:0:0): polling command not done
panic: scsipi_execute_xs
Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4: ret
zero,(ra
)
db{0}> reboot 4
> I'm wondering if using IPL_HIGH for the mutex changes anything:
>
> Index: sys/arch/alpha/alpha/pmap.c
> ===================================================================
> RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
> retrieving revision 1.243
> diff -u -p -r1.243 pmap.c
> --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000 1.243
> +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 18:01:59 -0000
> @@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
> for (i = 0; i < ALPHA_MAXPROCS; i++) {
> TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
> mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
> - IPL_SCHED);
> + IPL_HIGH);
> }
> #endif
I then applied this patch. I tried several 'build.sh -j4' jobs,
but all of them seemed to abort with some host tool tripping up:
/usr/tools/bin/nbgroff: grotty: Illegal instruction (core dumped)
/usr/tools/bin/nbgroff: troff: Illegal instruction (core dumped)
Then on one occasion, the kernel started to repeatedly spew
Whoa! pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee00
messages to the console. The pj value were identical in all the
messages, but the ci_index value varied (0 or 1).
Do you still think I should try and increase the IPL level of the
pool_cache entry as specified in your message?
-jarle
Home |
Main Index |
Thread Index |
Old Index