NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-alpha/38335 (kernel freeze on alpha MP system)

The following reply was made to PR port-alpha/38335; it has been noted by GNATS.

From: "Michael L. Hitch" <>
To: Jarle Greipsland <>
Subject: Re: port-alpha/38335 (kernel freeze on alpha MP system)
Date: Sun, 18 Oct 2009 11:37:20 -0600 (MDT)

 On Sun, 18 Oct 2009, Jarle Greipsland wrote:
 > I ran a few more ' -j4' jobs without changing anything,
 > and the kernel continued to emit spurious "Whoa!  pool_cache_get
 > returned an in-use entry!"  messages.
 > Then, on Oct 14, I synced my source tree to -current again (this
 > might have been a stupid move on my part...), and built and
 > booted a new GENERIC.MP kernel.  After about an hour into a
 > ' -j4' job, it paniced with a new (to me) message:
 > panic: pool_get: pmaptlb: page empty
    This would likely indicate pool_cache corruption, which would be in line 
 with the one that my patch worked around.
 > db{0}> mach cpu 1
 > Using CPU 1
 > db{0}> tr
 > CPU 0: fatal kernel trap:
    There's a bug somewhere in handling the registers of the cpus other than 
 the one that's running DDB.  I can't remember if this ever worked 
 properly, although I'm sure I've at least tried it in the past.
 > I also got a kernel core dump from this one.  Please give some
 > instructions if you want me to dig up some data from the dump.
    I don't think there's a way to get much information out of a core dump 
 since 5.0.  The new gdb doesn't yet have the code to access a core dump.
 One more thing to try fixing some day.
 > Later on, after booting the same kernel, I got some more
 > "Whoa!  pool_cache_get returned an in-use entry!" messages.  Also
 > at some point (I don't think it correlated in time with any of
 > the kernel messages) the system hung hard again.  I pressed the
 > halt button, and entered DDB:
 > halted CPU 0
 > CPU 1 is not halted
 > halt code = 1
 > operator initiated halt
 > PC = fffffc00005f6320
 > P00>>>cont
 > continuing CPU 0
 > CP - RESTORE_TERM routine to be called
 > panic: user requested console halt
 > Stopped in pid 0.35 (system) at netbsd:cpu_Debugger+0x4:        ret     
 > zero,(ra
 > )
 > db{0}> trace
 > cpu_Debugger() at netbsd:cpu_Debugger+0x4
 > panic() at netbsd:panic+0x268
 > console_restart() at netbsd:console_restart+0x78
 > XentRestart() at netbsd:XentRestart+0x90
 > --- console restart (from ipl 2) ---
 > mutex_spin_enter() at netbsd:mutex_spin_enter+0x200
 > pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x138
 > pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x168
 > pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x178
 > alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 > interrupt() at netbsd:interrupt+0x88
 > XentInt() at netbsd:XentInt+0x1c
 > --- interrupt (from ipl 4) ---
 > mutex_exit() at netbsd:mutex_exit+0x10
 > pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x6c
 > pool_reclaim() at netbsd:pool_reclaim+0x68
 > pool_drain_end() at netbsd:pool_drain_end+0x44
 > uvm_pageout() at netbsd:uvm_pageout+0x740
 > exception_return() at netbsd:exception_return
    This also appears to be pool cache corruption problems.  In this 
 particular case, CPU0 has received an IPI interrupt to shootdown its tlb 
 entries, and is trying to to release a pool cache entry used for this 
 call, and is hung trying to acquire a mutex related to the pool cache.
 The other cpu presumably holds that lock (and may be trying to acquire a 
 lock held by CPU0, leading to the classic deadlock).
 >>     I'm wondering if using IPL_HIGH for the mutex changes anything:
 >>  Index: sys/arch/alpha/alpha/pmap.c
 >>  ===================================================================
 >>  RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 >>  retrieving revision 1.243
 >>  diff -u -p -r1.243 pmap.c
 >>  --- sys/arch/alpha/alpha/pmap.c 4 Oct 2009 17:00:31 -0000       1.243
 >>  +++ sys/arch/alpha/alpha/pmap.c 11 Oct 2009 18:01:59 -0000
 >>  @@ -962,7 +962,7 @@ pmap_bootstrap(paddr_t ptaddr, u_int max
 >>           for (i = 0; i < ALPHA_MAXPROCS; i++) {
 >>                   TAILQ_INIT(&pmap_tlb_shootdown_q[i].pq_head);
 >>                   mutex_init(&pmap_tlb_shootdown_q[i].pq_lock, 
 >>  -                   IPL_SCHED);
 >>  +                   IPL_HIGH);
 >>           }
 >>    #endif
 > I then applied this patch.  I tried several ' -j4' jobs,
 > but all of them seemed to abort with some host tool tripping up:
 > /usr/tools/bin/nbgroff: grotty: Illegal instruction (core dumped)
 > /usr/tools/bin/nbgroff: troff: Illegal instruction (core dumped)
    This is the problem I get now on my ES45.  I haven't been able to figure 
 out anything from the process core dump, other than memory seems corrupted 
 or incorrect.  I'm not sure if it might be related to the tlb shootdown 
 code not working properly, or perhaps a missing memory barrier call 
 > Then on one occasion, the kernel started to repeatedly spew
 > Whoa!  pool_cache_get returned an in-use entry! ci_index 0 pj 
 > 0xfffffc003f9ee00
 > messages to the console.  The pj value were identical in all the
 > messages, but the ci_index value varied (0 or 1).
 > Do you still think I should try and increase the IPL level of the
 > pool_cache entry as specified in your message?
    Try the higher IPL on the pmap_tlb_shootdown_job_cache.  I'm not real 
 clear on how that IPL is used, but I'm guessing that might be the IPL used 
 by any locking using by the pool cache routines, and may be needed to 
 prevent the IPI interrupt from interrupting a pool cache operation.  [That 
 might have caused the deadlock you observed above.]  Try IPL_CLOCK first, 
 and then IPL_HIGH if that still has problems relating to the pool cache.
 Michael L. Hitch             
 Computer Consultant
 Information Technology Center
 Montana State University       Bozeman, MT     USA

Home | Main Index | Thread Index | Old Index