Subject: Re: Getting "TLB IPI rendezvous failed..."
To: Stephan Uphoff <ups@tree.com>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: tech-kern
Date: 01/21/2005 11:53:08
On Thu, Jan 20, 2005 at 01:40:48PM +0100, Manuel Bouyer wrote:
> Here it is. Still pipe related, but what the second CPU was doinG at the
> same time is interesting:
> 
> CPU 1 (the one that paniced):
> panic()
> pmap_tlb_shootnow()
> pamp_kremove()
> pipe_direct_write()
> pipe_write()
> ...
> 
> CPU 0:
> _kenrel_lock()
> intr_biglock_wrapper()
> Xintr_ioapic_edge15()
> Xspllower()
> _kernel_lock()
> x86_softintrlock()
> Xsoftclock()
> 
> I just noticed that I didn't have lockdebug enabled in this kernel :(
> I'll install a new one for the next panic.

LOCKDEBUG didn't bring anything more.
The new panic I got tonight:
CPU 1:
panic
pmap_tlb_shootnow
pmap_do_remove
pmap_remove
ubc_alloc
ffs_write
CPU 0:
_kernel_lock
intr_biglock_wrapper
Xintr_ioapic_level10
Xspllower
simple_lock_held
_kernel_lock
x86_softintlock
Xsoftclock

A few things to notice:
- it seems it's always CPU1 which panics, and cpu0 which holds the lock
- even though pipe didn't appear in this trace, it's still related to
  amanda backups, which makes an heavy use of pipes
- again it had about 500M free RAM when it paniced
- cpu0 seems to always come from a soft clock interrupt
- the recent changes to protect IPIs with splclock() cause the traces to
  be different. With 2.0, CPU 0 was stuck with a tsleep()/mi_switch()
  in the path.

Anything else I can try to help debug this ?

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--