Re: port-alpha/38335 (kernel freeze on alpha MP system)

To: gnats-bugs%NetBSD.org@localhost, mhitch%lightning.msu.montana.edu@localhost
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
From: Jarle Greipsland <jarle%uninett.no@localhost>
Date: Sun, 18 Oct 2009 17:20:33 +0200 (CEST)
"Michael L. Hitch" <mhitch%lightning.msu.montana.edu@localhost> writes:
>     Yes - that patch just attempts to work around a duplicate pool cache 
>  entry.  I have not been able to replicate this on my ES40, and don't have 
>  remote power capability on my CS20 which makes it harder to debug when 
>  things can hang hard.
Some more results:

I ran a few more 'build.sh -j4' jobs without changing anything,
and the kernel continued to emit spurious "Whoa!  pool_cache_get
returned an in-use entry!"  messages.

Then, on Oct 14, I synced my source tree to -current again (this
might have been a stupid move on my part...), and built and
booted a new GENERIC.MP kernel.  After about an hour into a
'build.sh -j4' job, it paniced with a new (to me) message:

panic: pool_get: pmaptlb: page empty

Stopped in pid 25249.1 (cc1plus) at     netbsd:cpu_Debugger+0x4:        ret     
zero,(ra)

db{0}> tr
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x268
pool_get() at netbsd:pool_get+0x408
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x1ac
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x168
pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0xe8
pmap_kenter_pa() at netbsd:pmap_kenter_pa+0xe8
uvm_km_alloc() at netbsd:uvm_km_alloc+0x170
kmem_backend_alloc() at netbsd:kmem_backend_alloc+0x48
vmem_xalloc() at netbsd:vmem_xalloc+0x318
vmem_alloc() at netbsd:vmem_alloc+0x90
kmem_poolpage_alloc() at netbsd:kmem_poolpage_alloc+0x48
pool_grow() at netbsd:pool_grow+0x64
pool_get() at netbsd:pool_get+0x5c
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x1ac
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x168
kmem_alloc() at netbsd:kmem_alloc+0xc8
amap_extend() at netbsd:amap_extend+0x1b8
uvm_mapent_trymerge() at netbsd:uvm_mapent_trymerge+0x3dc
amap_copy() at netbsd:amap_copy+0x410
uvm_fault_internal() at netbsd:uvm_fault_internal+0xbb0
trap() at netbsd:trap+0x548
XentMM() at netbsd:XentMM+0x20
--- memory management fault (from ipl 0) ---
--- user mode ---
db{0}> show reg
v0          0x6
t0          0x1
t1          0x1
t2          0xfffffc003ff48000
t3          0
t4          0
t5          0xfffffc0000b46a2d  __func__.21238+0x91c
t6          0
t7          0
s0          0xfffffc0000c378e0  msgbufenabled
s1          0x104
s2          0xfffffc0000c350a8  db_onpanic
s3          0xfffffc0000c6b8e8  pmap_tlb_shootdown_job_cache+0x1a8
s4          0
s5          0x2
s6          0
a0          0
a1          0xfffffd01fc0003f8
a2          0
a3          0x8
a4          0x3
a5          0xfffffc0000000008
t8          0
t9          0x8
t10         0xfffffc0000c941a8  uvmexp
t11         0
ra          0xfffffc000080c5a8  panic+0x268
t12         0xfffffc00003eb590  cpu_Debugger
at          0x1
gp          0xfffffc0000c30928  
__link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp          0xfffffe0013c7f5e8
pc          0xfffffc00003eb594  cpu_Debugger+0x4
ps          0
ai          0
pv          0xfffffc00003eb590  cpu_Debugger
netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
db{0}> ps
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
21308>   1 7   1     40004   fffffc0024916c00            cc1plus
9441     1 3   0        84   fffffc0024916000   alpha--netbsd-c+ wait
21710    1 3   0        84   fffffc0024813000                 sh wait
26347    1 2   0     40004   fffffc002de34800            cc1plus
18814    1 3   0        84   fffffc0024813400   alpha--netbsd-c+ wait
24418    1 3   0        84   fffffc002daf2800                 sh wait
25249>   1 7   0         4   fffffc002d29e400            cc1plus
19917    1 3   0        84   fffffc0024917c00   alpha--netbsd-c+ wait
2263     1 3   1        84   fffffc002daf3000                 sh wait
16091    1 2   0     40004   fffffc002caa9800            cc1plus
19836    1 3   0        84   fffffc003edac400   alpha--netbsd-c+ wait
7537     1 3   1        84   fffffc00233e9400                 sh wait
18485    1 3   0        84   fffffc0024d52000             nbmake select
8826     1 3   0        84   fffffc002daf2000                 sh wait
28933    1 3   1        84   fffffc002cab8000             nbmake select
15439    1 3   1        84   fffffc0024812000                 sh wait
22920    1 3   1        84   fffffc0024917800             nbmake select
16006    1 3   1        84   fffffc0024d53400                 sh wait
22712    1 3   1        84   fffffc0024d52800             nbmake select
28515    1 3   1        84   fffffc002caa9c00                 sh wait
24901    1 3   1        84   fffffc00233e8800             nbmake select
22708    1 3   1        84   fffffc0024917000             pickup kqueue
20418    1 3   0        84   fffffc002d29ec00               bash ttyraw
17752    1 3   0        84   fffffc002caa9400              xterm select
29471    1 3   1        84   fffffc003f678000                csh pause
8486     1 3   0        84   fffffc002cab8c00               sshd select
23567    1 3   1        84   fffffc002d29f400               sshd netio
4106     1 3   0        84   fffffc00233e8000                 sh wait
3271     1 3   1        84   fffffc00233e8c00             nbmake select
3779     1 3   0        84   fffffc0024916400                 sh wait
3702     1 3   1        84   fffffc00233e9000             nbmake select
3935     1 3   1        84   fffffc003ca9e400                 sh wait
3603     1 3   1        84   fffffc003ca9f000             nbmake select
750      1 3   0        84   fffffc003ca9e800                 sh wait
730      1 3   1        84   fffffc003ca9ec00               tail kqueue
733      1 3   0        84   fffffc003f678400               tcsh pause
507      1 3   0        84   fffffc003ca9f800               tcsh pause
40       1 3   1        84   fffffc003ca9fc00               tcsh pause
41       1 3   1        84   fffffc003edad000              xterm select
459      1 3   0        84   fffffc003f9fe800                csh pause
473      1 3   1        84   fffffc003edac800               sshd select
528      1 3   0        84   fffffc003f960000               sshd netio
501      1 3   1        84   fffffc003f679800               qmgr kqueue
350      1 3   1        84   fffffc003f961c00              getty ttyraw
484      1 3   1        84   fffffc003f679400               cron nanoslp
498      1 3   1        84   fffffc003edac000              inetd kqueue
468      1 3   1        84   fffffc003edacc00             master kqueue
300      1 3   0        84   fffffc003edad400               sshd select
301      1 3   0        84   fffffc003edad800               ntpd pause
183      1 3   0        84   fffffc003edadc00          mount_mfs mfsidl
152      1 3   1        84   fffffc003f679000            syslogd kqueue
1        1 3   0        84   fffffc003f961000               init wait
0       39 3   0       204   fffffc003f678800            physiod physiod
0       38 3   1       204   fffffc003f960400        vmem_rehash vmem_rehash
0       37 3   0       204   fffffc003f960800           aiodoned aiodoned
0       36 3   1       204   fffffc003f9fe400            ioflush syncer
0       35 3   1       204   fffffc003f9fe000           pgdaemon pgdaemon
0       34 3   0       204   fffffc003f961400          cryptoret crypto_wait
0       33 3   0       204   fffffc003f961800          atapibus0 sccomp
0       31 3   0       204   fffffc003f960c00              unpgc unpgc
0       22 3   1       204   fffffc003f9fec00            atabus1 atath
0       21 3   0       204   fffffc003f9ff000            atabus0 atath
0       20 3   1       204   fffffc003f9ff400           scsibus0 sccomp
0       19 3   1       204   fffffc003f9ff800            xcall/1 xcall
0       18 1   1       204   fffffc003f9ffc00          softser/1
0       17 1   1       204   fffffc003fcf8000          softclk/1
0       16 1   1       204   fffffc003fcf8400          softbio/1
0       15 1   1       204   fffffc003fcf8800          softnet/1
0       14 1   1       205   fffffc003fcf8c00             idle/1
0       13 3   1       204   fffffc003fcf9000         pmfsuspend pmfsuspend
0       12 3   0       204   fffffc003fcf9400           pmfevent pmfevent
0       11 3   1       204   fffffc003fcf9800           nfssilly nfssilly
0       10 3   0       204   fffffc003fcf9c00            cachegc cachegc
0        9 3   1       204   fffffc003fe28000              vrele vrele
0        8 3   1       204   fffffc003fe28400          modunload modunload
0        7 3   0       204   fffffc003fe28800            xcall/0 xcall
0        6 1   0       204   fffffc003fe28c00          softser/0
0        5 1   0       204   fffffc003fe29000          softclk/0
0        4 1   0       204   fffffc003fe29400          softbio/0
0        3 1   0       204   fffffc003fe29800          softnet/0
0        2 1   0       205   fffffc003fe29c00             idle/0
0        1 3   0       204   fffffc0000c16660            swapper schedule
db{0}> mach cpu 1
Using CPU 1
db{0}> tr

CPU 0: fatal kernel trap:

CPU 0    trap entry = 0x2 (memory management fault)
CPU 0    a0         = 0xffffffffffffffd9
CPU 0    a1         = 0x1
CPU 0    a2         = 0x0
CPU 0    pc         = 0xfffffc00003ee944
CPU 0    ra         = 0xfffffc00003e8104
CPU 0    pv         = 0xfffffc00003ee890
CPU 0    curlwp     = 0xfffffc002d29e400
CPU 0        pid = 25249, comm = cc1plus

Caught exception in ddb.
db{0}> show reg
v0          0x1
t0          0x5b93a000
t1          0x2000
t2          0
t3          0x2000
t4          0xf
t5          0x16055e050
t6          0x2
t7          0
s0          0x2
s1          0x5b93a000
s2          0x161214008
s3          0x161214000
s4          0x16055e1d0
s5          0x16055e040
s6          0x50
a0          0x5b93a000
a1          0x48
a2          0
a3          0
a4          0x5
a5          0x5
t8          0x160814400
t9          0x3
t10         0x120582e58
t11         0x120581c28
ra          0x16063d168
t12         0x16064db44
at          0x2
gp          0x1606fb790
sp          0x1
pc          0x16064db70
ps          0x8
ai          0x120581c28
pv          0x16064db44
0x16064db70:    and     t1,#0xff,t4

I also got a kernel core dump from this one.  Please give some
instructions if you want me to dig up some data from the dump.

Later on, after booting the same kernel, I got some more
"Whoa!  pool_cache_get returned an in-use entry!" messages.  Also
at some point (I don't think it correlated in time with any of
the kernel messages) the system hung hard again.  I pressed the
halt button, and entered DDB:
halted CPU 0
CPU 1 is not halted

halt code = 1
operator initiated halt
PC = fffffc00005f6320
P00>>>cont

continuing CPU 0
CP - RESTORE_TERM routine to be called
panic: user requested console halt
Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
)
db{0}> trace
cpu_Debugger() at netbsd:cpu_Debugger+0x4
panic() at netbsd:panic+0x268
console_restart() at netbsd:console_restart+0x78
XentRestart() at netbsd:XentRestart+0x90
--- console restart (from ipl 2) ---
mutex_spin_enter() at netbsd:mutex_spin_enter+0x200
pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x138
pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x168
pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
interrupt() at netbsd:interrupt+0x88
XentInt() at netbsd:XentInt+0x1c
--- interrupt (from ipl 4) ---
mutex_exit() at netbsd:mutex_exit+0x10
pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x6c
pool_reclaim() at netbsd:pool_reclaim+0x68
pool_drain_end() at netbsd:pool_drain_end+0x44
uvm_pageout() at netbsd:uvm_pageout+0x740
exception_return() at netbsd:exception_return
--- root of call graph ---
db{0}> show reg
v0          0xfffffc0000c0de28  cpu_info_primary
t0          0x1
t1          0x1
t2          0xfffffc003ff48000
t3          0
t4          0
t5          0xfffffc0000b46a2d  __func__.21238+0x91c
t6          0xfffffc002e386888
t7          0
s0          0xfffffc0000c378e0  msgbufenabled
s1          0x104
s2          0xfffffc0000c350a8  db_onpanic
s3          0xfffffc0000c8ef18  panicstr
s4          0
s5          0xfffffc0000808e48  pool_cache_put_slow+0x138
s6          0
a0          0
a1          0xfffffd01fc0003f8
a2          0
a3          0x8
a4          0x3
a5          0xfffffc0000000008
t8          0xb
t9          0x8
t10         0x7fff
t11         0
ra          0xfffffc000080c5a8  panic+0x268
t12         0xfffffc00003eb590  cpu_Debugger
at          0x160328000
gp          0xfffffc0000c30928  
__link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
sp          0xfffffe00130e5908
pc          0xfffffc00003eb594  cpu_Debugger+0x4
ps          0x7
ai          0
pv          0xfffffc00003eb590  cpu_Debugger
netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
db{0}> mach cpu 1
Using CPU 1
db{0}> trace

CPU 0: fatal kernel trap:

CPU 0    trap entry = 0x2 (memory management fault)
CPU 0    a0         = 0xffffffffffffffd9
CPU 0    a1         = 0x1
CPU 0    a2         = 0x0
CPU 0    pc         = 0xfffffc00003ee944
CPU 0    ra         = 0xfffffc00003e8104
CPU 0    pv         = 0xfffffc00003ee890
CPU 0    curlwp     = 0xfffffc003f9fe800
CPU 0        pid = 0, comm = system

Caught exception in ddb.
db{0}> c     
cpu1: shutting down...

dumping to dev 8,1 offset 2097597
dump i/o error


sd0(esiop0:0:0:0): polling command not done
panic: scsipi_execute_xs
Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
)
db{0}> reboot 4

>     I'm wondering if using IPL_HIGH for the mutex changes anything:
>  
>  Index: sys/arch/alpha/alpha/pmap.c
>  ===================================================================
>  RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
>  retrieving revision 1.243
>  diff -u -p -r1.243 pmap.c
>  --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000       1.243
>  +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 18:01:59 -0000
>  @@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
>           for (i = 0; i < ALPHA_MAXPROCS; i++) {
>                   TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
>                   mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
>  -                   IPL_SCHED);
>  +                   IPL_HIGH);
>           }
>    #endif

I then applied this patch.  I tried several 'build.sh -j4' jobs,
but all of them seemed to abort with some host tool tripping up:
/usr/tools/bin/nbgroff: grotty: Illegal instruction (core dumped)
/usr/tools/bin/nbgroff: troff: Illegal instruction (core dumped)

Then on one occasion, the kernel started to repeatedly spew 
Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee00
messages to the console.  The pj value were identical in all the
messages, but the ci_index value varied (0 or 1).

Do you still think I should try and increase the IPL level of the
pool_cache entry as specified in your message?

                                        -jarle
References:
- Re: port-alpha/38335 (kernel freeze on alpha MP system)
  - From: Michael L. Hitch
Prev by Date: Re: port-i386/41629 (Fixed identical ESS1878 on Fujitsu FMV-5100 NC/S by PnPBIOS)
Next by Date: PR/39299 CVS commit: [netbsd-4] src/sys/arch
Previous by Thread: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Next by Thread: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Indexes:
Home | Main Index | Thread Index | Old Index