tech-kern: Re: Getting "TLB IPI rendezvous failed..."

Subject: Re: Getting "TLB IPI rendezvous failed..."
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: Stephan Uphoff <ups@tree.com>
List: tech-kern
Date: 01/21/2005 11:03:50

I have a few ideas.
Hopefully I will be able to send you a test patch over the weekend.
Can you send me your dmesg?

Thanks
Stephan

On Fri, 2005-01-21 at 05:53, Manuel Bouyer wrote:
> On Thu, Jan 20, 2005 at 01:40:48PM +0100, Manuel Bouyer wrote:
> > Here it is. Still pipe related, but what the second CPU was doinG at the
> > same time is interesting:
> > 
> > CPU 1 (the one that paniced):
> > panic()
> > pmap_tlb_shootnow()
> > pamp_kremove()
> > pipe_direct_write()
> > pipe_write()
> > ...
> > 
> > CPU 0:
> > _kenrel_lock()
> > intr_biglock_wrapper()
> > Xintr_ioapic_edge15()
> > Xspllower()
> > _kernel_lock()
> > x86_softintrlock()
> > Xsoftclock()
> > 
> > I just noticed that I didn't have lockdebug enabled in this kernel :(
> > I'll install a new one for the next panic.
> 
> LOCKDEBUG didn't bring anything more.
> The new panic I got tonight:
> CPU 1:
> panic
> pmap_tlb_shootnow
> pmap_do_remove
> pmap_remove
> ubc_alloc
> ffs_write
> CPU 0:
> _kernel_lock
> intr_biglock_wrapper
> Xintr_ioapic_level10
> Xspllower
> simple_lock_held
> _kernel_lock
> x86_softintlock
> Xsoftclock
> 
> A few things to notice:
> - it seems it's always CPU1 which panics, and cpu0 which holds the lock
> - even though pipe didn't appear in this trace, it's still related to
>   amanda backups, which makes an heavy use of pipes
> - again it had about 500M free RAM when it paniced
> - cpu0 seems to always come from a soft clock interrupt
> - the recent changes to protect IPIs with splclock() cause the traces to
>   be different. With 2.0, CPU 0 was stuck with a tsleep()/mi_switch()
>   in the path.
> 
> Anything else I can try to help debug this ?