Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: 10.99.7 panic: defibrillate



> Date: Sun, 13 Aug 2023 06:16:51 -0400
> From: Greg Troxel <gdt%lexort.com@localhost>
> 
> Would it be useful for heartbeat to have a just-log-don't-panic option?

Worth considering, but...

> It feels like are in a state where we know there is a problem somewhere,
> and we don't know if it is in heartbeat, the kernel, or hardware.

...in this case it is already clear that under heavy disk I/O,
something is either holding onto a spin lock or starving softints and
threads at priority below softbio for much too long.

Holding a spin lock -- or otherwise running at raised IPL -- for 5sec
is already enough to violate the contract of the timecounter at
hz=100, which can lead to monotonic time going backwards, which breaks
all kinds of things but maybe only in subtle ways that are extremely
hard to diagnose retrospectively.

I thought all the uvm aiodone business was supposed tbe deferred to
workqueue context (which would not hold up heartbeats), but it looks
like we have a path (softbio) biodone -> biodone2 -> uvm_aio_aiodone
-> uvm_pagermapout -> vm_map_lock -> cv_wait which is forbidden in
softint context (and should really trip a KASSERT).  This might not be
the problem but it's evidence that the code path is on shaky grounds.

> I would not want to run a watchdog that reboots the system unless the FP
> rate is well under once per year, and really under 0.2/year.  Having
> this logged instead of panicing would make it more comfortable to turn
> on.  Probably it should be default to not panic, if this turns into
> enough reports that it seems to have significantly non-zero probability.

So far the only reports I've seen have been true alarms about
something being broken.  Most of the problems that this will catch
would otherwise manifest as `NetBSD stopped responding and I wasn't
able to get a core dump' (leading to useless undiagnosable PRs), not
as `huh, I saw this weird detailed log message', so the diagnostic
value of the heartbeat panic in those circumstances is very high.

Note that a hardware watchdog timer is a little bit different: it will
usually just reset the machine, giving no opportunity for diagnostics
like a crash dump.

> (Presumably atf runs on real hw survive HEARTBEAT though, so whatever is
> happening seems low probability to start with.)

Right.  My guess is that this may be related to problems that we've
been trying to diagnose regarding extreme delays at shutdown after
heavy disk I/O, which we need more information to figure out.
Possibly related to the yamt-pagecache merge, possibly related to the
zfs pagedaemon thrashing.


Home | Main Index | Thread Index | Old Index