Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: 10.99.7 panic: defibrillate



> Date: Mon, 14 Aug 2023 18:16:49 +0200
> From: Thomas Klausner <wiz%NetBSD.org@localhost>
> 
> On Mon, Aug 14, 2023 at 12:41:06PM +0200, Thomas Klausner wrote:
> > I had followed your suggestion and bumped the heartbeat limit from 15
> > to 300, but today it paniced again.
> > 
> > panic: cpu8: found cpu9 heart stopped beating and unresponsive
> > 
> > I have a core dump in case you want any particular details.
> > 
> > I've now switched set it to 0.
> 
> and had a hard hang less than half a day later.
> 
> This hasn't been happening in 10.99.5 (at least not with that
> frequency), which had uptimes of weeks, so either the heartbeat code
> introduced additional problems (even if disabled this way) or
> something else got worse, or I am really really unlucky right now.

Welp.

I don't think simply having the heartbeat(9) code around would cause a
hang -- it's new code, which is higher-risk, but the design of the
code is very low-risk (all loops are bounded; interrupt handler and
soft interrupt handler are short and easy to audit for bounded
latency; each CPU only writes to its own per-CPU state).  I think it's
more likely something else changed.

Looks like it's time to bisect over the time since your last good
build, and see if you can make it a whole day without panicking?

874 commits since I bumped 10.99.5 (which was incidentally when I
introduced heartbeat(9)), so...it should only take a week or two if
the problem takes half a day to reproduce!


Home | Main Index | Thread Index | Old Index