Re: port-alpha/38335 (kernel freeze on alpha MP system)

To: port-alpha-maintainer%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,jarle%uninett.no@localhost
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
From: Jarle Greipsland <jarle%uninett.no@localhost>
Date: Sun, 18 Oct 2009 15:25:02 +0000 (UTC)
The following reply was made to PR port-alpha/38335; it has been noted by GNATS.

From: Jarle Greipsland <jarle%uninett.no@localhost>
To: gnats-bugs%NetBSD.org@localhost, mhitch%lightning.msu.montana.edu@localhost
Cc: port-alpha-maintainer%netbsd.org@localhost, 
gnats-admin%netbsd.org@localhost,
        netbsd-bugs%netbsd.org@localhost
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 18 Oct 2009 17:20:33 +0200 (CEST)

 "Michael L. Hitch" <mhitch%lightning.msu.montana.edu@localhost> writes:
 >     Yes - that patch just attempts to work around a duplicate pool cache 
 >  entry.  I have not been able to replicate this on my ES40, and don't have 
 >  remote power capability on my CS20 which makes it harder to debug when 
 >  things can hang hard.
 Some more results:
 
 I ran a few more 'build.sh -j4' jobs without changing anything,
 and the kernel continued to emit spurious "Whoa!  pool_cache_get
 returned an in-use entry!"  messages.
 
 Then, on Oct 14, I synced my source tree to -current again (this
 might have been a stupid move on my part...), and built and
 booted a new GENERIC.MP kernel.  After about an hour into a
 'build.sh -j4' job, it paniced with a new (to me) message:
 
 panic: pool_get: pmaptlb: page empty
 
 Stopped in pid 25249.1 (cc1plus) at     netbsd:cpu_Debugger+0x4:        ret    
 zero,(ra)
 
 db{0}> tr
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x268
 pool_get() at netbsd:pool_get+0x408
 pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x1ac
 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x168
 pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0xe8
 pmap_kenter_pa() at netbsd:pmap_kenter_pa+0xe8
 uvm_km_alloc() at netbsd:uvm_km_alloc+0x170
 kmem_backend_alloc() at netbsd:kmem_backend_alloc+0x48
 vmem_xalloc() at netbsd:vmem_xalloc+0x318
 vmem_alloc() at netbsd:vmem_alloc+0x90
 kmem_poolpage_alloc() at netbsd:kmem_poolpage_alloc+0x48
 pool_grow() at netbsd:pool_grow+0x64
 pool_get() at netbsd:pool_get+0x5c
 pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x1ac
 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x168
 kmem_alloc() at netbsd:kmem_alloc+0xc8
 amap_extend() at netbsd:amap_extend+0x1b8
 uvm_mapent_trymerge() at netbsd:uvm_mapent_trymerge+0x3dc
 amap_copy() at netbsd:amap_copy+0x410
 uvm_fault_internal() at netbsd:uvm_fault_internal+0xbb0
 trap() at netbsd:trap+0x548
 XentMM() at netbsd:XentMM+0x20
 --- memory management fault (from ipl 0) ---
 --- user mode ---
 db{0}> show reg
 v0          0x6
 t0          0x1
 t1          0x1
 t2          0xfffffc003ff48000
 t3          0
 t4          0
 t5          0xfffffc0000b46a2d  __func__.21238+0x91c
 t6          0
 t7          0
 s0          0xfffffc0000c378e0  msgbufenabled
 s1          0x104
 s2          0xfffffc0000c350a8  db_onpanic
 s3          0xfffffc0000c6b8e8  pmap_tlb_shootdown_job_cache+0x1a8
 s4          0
 s5          0x2
 s6          0
 a0          0
 a1          0xfffffd01fc0003f8
 a2          0
 a3          0x8
 a4          0x3
 a5          0xfffffc0000000008
 t8          0
 t9          0x8
 t10         0xfffffc0000c941a8  uvmexp
 t11         0
 ra          0xfffffc000080c5a8  panic+0x268
 t12         0xfffffc00003eb590  cpu_Debugger
 at          0x1
 gp          0xfffffc0000c30928  
__link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
 sp          0xfffffe0013c7f5e8
 pc          0xfffffc00003eb594  cpu_Debugger+0x4
 ps          0
 ai          0
 pv          0xfffffc00003eb590  cpu_Debugger
 netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
 db{0}> ps
 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 21308>   1 7   1     40004   fffffc0024916c00            cc1plus
 9441     1 3   0        84   fffffc0024916000   alpha--netbsd-c+ wait
 21710    1 3   0        84   fffffc0024813000                 sh wait
 26347    1 2   0     40004   fffffc002de34800            cc1plus
 18814    1 3   0        84   fffffc0024813400   alpha--netbsd-c+ wait
 24418    1 3   0        84   fffffc002daf2800                 sh wait
 25249>   1 7   0         4   fffffc002d29e400            cc1plus
 19917    1 3   0        84   fffffc0024917c00   alpha--netbsd-c+ wait
 2263     1 3   1        84   fffffc002daf3000                 sh wait
 16091    1 2   0     40004   fffffc002caa9800            cc1plus
 19836    1 3   0        84   fffffc003edac400   alpha--netbsd-c+ wait
 7537     1 3   1        84   fffffc00233e9400                 sh wait
 18485    1 3   0        84   fffffc0024d52000             nbmake select
 8826     1 3   0        84   fffffc002daf2000                 sh wait
 28933    1 3   1        84   fffffc002cab8000             nbmake select
 15439    1 3   1        84   fffffc0024812000                 sh wait
 22920    1 3   1        84   fffffc0024917800             nbmake select
 16006    1 3   1        84   fffffc0024d53400                 sh wait
 22712    1 3   1        84   fffffc0024d52800             nbmake select
 28515    1 3   1        84   fffffc002caa9c00                 sh wait
 24901    1 3   1        84   fffffc00233e8800             nbmake select
 22708    1 3   1        84   fffffc0024917000             pickup kqueue
 20418    1 3   0        84   fffffc002d29ec00               bash ttyraw
 17752    1 3   0        84   fffffc002caa9400              xterm select
 29471    1 3   1        84   fffffc003f678000                csh pause
 8486     1 3   0        84   fffffc002cab8c00               sshd select
 23567    1 3   1        84   fffffc002d29f400               sshd netio
 4106     1 3   0        84   fffffc00233e8000                 sh wait
 3271     1 3   1        84   fffffc00233e8c00             nbmake select
 3779     1 3   0        84   fffffc0024916400                 sh wait
 3702     1 3   1        84   fffffc00233e9000             nbmake select
 3935     1 3   1        84   fffffc003ca9e400                 sh wait
 3603     1 3   1        84   fffffc003ca9f000             nbmake select
 750      1 3   0        84   fffffc003ca9e800                 sh wait
 730      1 3   1        84   fffffc003ca9ec00               tail kqueue
 733      1 3   0        84   fffffc003f678400               tcsh pause
 507      1 3   0        84   fffffc003ca9f800               tcsh pause
 40       1 3   1        84   fffffc003ca9fc00               tcsh pause
 41       1 3   1        84   fffffc003edad000              xterm select
 459      1 3   0        84   fffffc003f9fe800                csh pause
 473      1 3   1        84   fffffc003edac800               sshd select
 528      1 3   0        84   fffffc003f960000               sshd netio
 501      1 3   1        84   fffffc003f679800               qmgr kqueue
 350      1 3   1        84   fffffc003f961c00              getty ttyraw
 484      1 3   1        84   fffffc003f679400               cron nanoslp
 498      1 3   1        84   fffffc003edac000              inetd kqueue
 468      1 3   1        84   fffffc003edacc00             master kqueue
 300      1 3   0        84   fffffc003edad400               sshd select
 301      1 3   0        84   fffffc003edad800               ntpd pause
 183      1 3   0        84   fffffc003edadc00          mount_mfs mfsidl
 152      1 3   1        84   fffffc003f679000            syslogd kqueue
 1        1 3   0        84   fffffc003f961000               init wait
 0       39 3   0       204   fffffc003f678800            physiod physiod
 0       38 3   1       204   fffffc003f960400        vmem_rehash vmem_rehash
 0       37 3   0       204   fffffc003f960800           aiodoned aiodoned
 0       36 3   1       204   fffffc003f9fe400            ioflush syncer
 0       35 3   1       204   fffffc003f9fe000           pgdaemon pgdaemon
 0       34 3   0       204   fffffc003f961400          cryptoret crypto_wait
 0       33 3   0       204   fffffc003f961800          atapibus0 sccomp
 0       31 3   0       204   fffffc003f960c00              unpgc unpgc
 0       22 3   1       204   fffffc003f9fec00            atabus1 atath
 0       21 3   0       204   fffffc003f9ff000            atabus0 atath
 0       20 3   1       204   fffffc003f9ff400           scsibus0 sccomp
 0       19 3   1       204   fffffc003f9ff800            xcall/1 xcall
 0       18 1   1       204   fffffc003f9ffc00          softser/1
 0       17 1   1       204   fffffc003fcf8000          softclk/1
 0       16 1   1       204   fffffc003fcf8400          softbio/1
 0       15 1   1       204   fffffc003fcf8800          softnet/1
 0       14 1   1       205   fffffc003fcf8c00             idle/1
 0       13 3   1       204   fffffc003fcf9000         pmfsuspend pmfsuspend
 0       12 3   0       204   fffffc003fcf9400           pmfevent pmfevent
 0       11 3   1       204   fffffc003fcf9800           nfssilly nfssilly
 0       10 3   0       204   fffffc003fcf9c00            cachegc cachegc
 0        9 3   1       204   fffffc003fe28000              vrele vrele
 0        8 3   1       204   fffffc003fe28400          modunload modunload
 0        7 3   0       204   fffffc003fe28800            xcall/0 xcall
 0        6 1   0       204   fffffc003fe28c00          softser/0
 0        5 1   0       204   fffffc003fe29000          softclk/0
 0        4 1   0       204   fffffc003fe29400          softbio/0
 0        3 1   0       204   fffffc003fe29800          softnet/0
 0        2 1   0       205   fffffc003fe29c00             idle/0
 0        1 3   0       204   fffffc0000c16660            swapper schedule
 db{0}> mach cpu 1
 Using CPU 1
 db{0}> tr
 
 CPU 0: fatal kernel trap:
 
 CPU 0    trap entry = 0x2 (memory management fault)
 CPU 0    a0         = 0xffffffffffffffd9
 CPU 0    a1         = 0x1
 CPU 0    a2         = 0x0
 CPU 0    pc         = 0xfffffc00003ee944
 CPU 0    ra         = 0xfffffc00003e8104
 CPU 0    pv         = 0xfffffc00003ee890
 CPU 0    curlwp     = 0xfffffc002d29e400
 CPU 0        pid = 25249, comm = cc1plus
 
 Caught exception in ddb.
 db{0}> show reg
 v0          0x1
 t0          0x5b93a000
 t1          0x2000
 t2          0
 t3          0x2000
 t4          0xf
 t5          0x16055e050
 t6          0x2
 t7          0
 s0          0x2
 s1          0x5b93a000
 s2          0x161214008
 s3          0x161214000
 s4          0x16055e1d0
 s5          0x16055e040
 s6          0x50
 a0          0x5b93a000
 a1          0x48
 a2          0
 a3          0
 a4          0x5
 a5          0x5
 t8          0x160814400
 t9          0x3
 t10         0x120582e58
 t11         0x120581c28
 ra          0x16063d168
 t12         0x16064db44
 at          0x2
 gp          0x1606fb790
 sp          0x1
 pc          0x16064db70
 ps          0x8
 ai          0x120581c28
 pv          0x16064db44
 0x16064db70:    and     t1,#0xff,t4
 
 I also got a kernel core dump from this one.  Please give some
 instructions if you want me to dig up some data from the dump.
 
 Later on, after booting the same kernel, I got some more
 "Whoa!  pool_cache_get returned an in-use entry!" messages.  Also
 at some point (I don't think it correlated in time with any of
 the kernel messages) the system hung hard again.  I pressed the
 halt button, and entered DDB:
 halted CPU 0
 CPU 1 is not halted
 
 halt code = 1
 operator initiated halt
 PC = fffffc00005f6320
 P00>>>cont
 
 continuing CPU 0
 CP - RESTORE_TERM routine to be called
 panic: user requested console halt
 Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4:        ret     
zero,(ra
 )
 db{0}> trace
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x268
 console_restart() at netbsd:console_restart+0x78
 XentRestart() at netbsd:XentRestart+0x90
 --- console restart (from ipl 2) ---
 mutex_spin_enter() at netbsd:mutex_spin_enter+0x200
 pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x138
 pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x168
 pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
 alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 interrupt() at netbsd:interrupt+0x88
 XentInt() at netbsd:XentInt+0x1c
 --- interrupt (from ipl 4) ---
 mutex_exit() at netbsd:mutex_exit+0x10
 pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x6c
 pool_reclaim() at netbsd:pool_reclaim+0x68
 pool_drain_end() at netbsd:pool_drain_end+0x44
 uvm_pageout() at netbsd:uvm_pageout+0x740
 exception_return() at netbsd:exception_return
 --- root of call graph ---
 db{0}> show reg
 v0          0xfffffc0000c0de28  cpu_info_primary
 t0          0x1
 t1          0x1
 t2          0xfffffc003ff48000
 t3          0
 t4          0
 t5          0xfffffc0000b46a2d  __func__.21238+0x91c
 t6          0xfffffc002e386888
 t7          0
 s0          0xfffffc0000c378e0  msgbufenabled
 s1          0x104
 s2          0xfffffc0000c350a8  db_onpanic
 s3          0xfffffc0000c8ef18  panicstr
 s4          0
 s5          0xfffffc0000808e48  pool_cache_put_slow+0x138
 s6          0
 a0          0
 a1          0xfffffd01fc0003f8
 a2          0
 a3          0x8
 a4          0x3
 a5          0xfffffc0000000008
 t8          0xb
 t9          0x8
 t10         0x7fff
 t11         0
 ra          0xfffffc000080c5a8  panic+0x268
 t12         0xfffffc00003eb590  cpu_Debugger
 at          0x160328000
 gp          0xfffffc0000c30928  
__link_set_prop_linkpools_sym__link__prop_array_pool+0x8008
 sp          0xfffffe00130e5908
 pc          0xfffffc00003eb594  cpu_Debugger+0x4
 ps          0x7
 ai          0
 pv          0xfffffc00003eb590  cpu_Debugger
 netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
 db{0}> mach cpu 1
 Using CPU 1
 db{0}> trace
 
 CPU 0: fatal kernel trap:
 
 CPU 0    trap entry = 0x2 (memory management fault)
 CPU 0    a0         = 0xffffffffffffffd9
 CPU 0    a1         = 0x1
 CPU 0    a2         = 0x0
 CPU 0    pc         = 0xfffffc00003ee944
 CPU 0    ra         = 0xfffffc00003e8104
 CPU 0    pv         = 0xfffffc00003ee890
 CPU 0    curlwp     = 0xfffffc003f9fe800
 CPU 0        pid = 0, comm = system
 
 Caught exception in ddb.
 db{0}> c     
 cpu1: shutting down...
 
 dumping to dev 8,1 offset 2097597
 dump i/o error
 
 
 sd0(esiop0:0:0:0): polling command not done
 panic: scsipi_execute_xs
 Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4:        ret     
zero,(ra
 )
 db{0}> reboot 4
 
 >     I'm wondering if using IPL_HIGH for the mutex changes anything:
 >  
 >  Index: sys/arch/alpha/alpha/pmap.c
 >  ===================================================================
 >  RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 >  retrieving revision 1.243
 >  diff -u -p -r1.243 pmap.c
 >  --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000       1.243
 >  +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 18:01:59 -0000
 >  @@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
 >           for (i = 0; i < ALPHA_MAXPROCS; i++) {
 >                   TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
 >                   mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, MUTEX_DEFAULT,
 >  -                   IPL_SCHED);
 >  +                   IPL_HIGH);
 >           }
 >    #endif
 
 I then applied this patch.  I tried several 'build.sh -j4' jobs,
 but all of them seemed to abort with some host tool tripping up:
 /usr/tools/bin/nbgroff: grotty: Illegal instruction (core dumped)
 /usr/tools/bin/nbgroff: troff: Illegal instruction (core dumped)
 
 Then on one occasion, the kernel started to repeatedly spew 
 Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 0xfffffc003f9ee00
 messages to the console.  The pj value were identical in all the
 messages, but the ci_index value varied (0 or 1).
 
 Do you still think I should try and increase the IPL level of the
 pool_cache entry as specified in your message?
 
                                        -jarle
Prev by Date: PR/39299 CVS commit: [netbsd-4] src/sys/arch/amd64/amd64
Next by Date: Re: bin/42143 (A transformation without the cast)
Previous by Thread: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Next by Thread: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Indexes:
Home | Main Index | Thread Index | Old Index