Subject: Re: Getting "TLB IPI rendezvous failed..."
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Stephan Uphoff <ups@tree.com>
List: tech-kern
Date: 01/15/2005 11:36:59
On Sat, 2005-01-15 at 10:05, Manuel Bouyer wrote:
> On Thu, Jan 13, 2005 at 01:16:26AM +0100, Frank van der Linden wrote:
> > On Tue, Jan 11, 2005 at 11:44:33PM -0500, Stephan Uphoff wrote:
> > > You can also just add the  splclock()/splx in x86_ipi as there is no
> > > need to protect the atomic bitmaps.
> > 
> > Ayup. Many thanks for the suggestions, I committed that change.
> > 
> > Can the people who had these problems (Fred, Havard?) see if this makes
> > any change? I tested if the changes work on one of my SMP systems, but
> > I could never reproduce the bug itself on those in the first place.
> 
> I backported these changes to a netbsd-2-0-RELEASE kernel. It didn't help for
> http://www.netbsd.org/cgi-bin/query-pr-single.pl?number=28541
> It paniced again while the amanda client was running.
> 
> If you think that a current kernel has additionnal fixes that may be relevant,

Mhhh .. is see a change in the i386 spl logic:

> Updaing ci_ilevel and testing ci_ipending must be done with all
>  interrupts
> off, or priority inversion can occur, which can lead to IPI deadlocks.
> Leaves interrupts off for a bit longer, sadly, but with no noticeable
> effects on the systems I tested on.
>
> From YAMAMOTO Takashi.

That did not make it to the 2-0-RELEASE.

> I can try a current kernel.

This would be helpful.

> Also, I also have a dual-CPU sparc10 with a similar workload (several mrtg
> processes, apc UPS on serial port, amanda client) which never show this
> problem, so it may be a i386-specific issue.