Re: 10.99.7 panic: defibrillate

To: Thomas Klausner <wiz%NetBSD.org@localhost>
Subject: Re: 10.99.7 panic: defibrillate
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sat, 12 Aug 2023 16:03:59 +0000

> Date: Sat, 12 Aug 2023 17:28:27 +0200
> From: Thomas Klausner <wiz%NetBSD.org@localhost>
> 
> I just got a new panic in 10.99.7 after running a pbulk for less than
> a day (after updating from 10.99.5, which was stable for weeks).
> ...
> vpanic() at netbsd:vpanic+0x173 panic() at netbsd :panic+0x3c
> defibrillate() at netbsd:defibrillate+Oxe3 hardclock() at netbsd:hardclock+0x8b
> Xresume_lapic_ltimer() at netbsd:Xresume_lapic_ltimer+Oxle
> --- interrupt ---
> pmap_tlb_shootnow() at netbsd:pmap_tlb_shootnow+0x1f7
> ...

This panic means that one CPU has detected that another CPU has failed
to run either the hardclock interrupt handler or the SOFTINT_CLOCK
softints in over 15 seconds, and triggered an interprocessor interrupt
in an attempt to panic rather than stay stuck where it appears to be
stuck -- here, pmap_tlb_shootnow.

Normally the hardclock interrupt handler runs every 10ms (or 1/hz sec;
default hz=100), and softints run reasonably promptly, so failing to
do this for 15 sec is extremely unusual and likely indicates a CPU is
wedged and unable to make progress.  For example, something may be
stuck in an infinite loop with a spin lock held or spl raised, which
blocks interrupts.

(The HEARTBEAT option, this system where CPUs check one another for
progress, is new as of last month.  The problems it uncovers would
likely have manifested as silent unresponsive hang before.)

1. Did you notice anything sluggish before the crash?

2. Can you start another bulk build and run the following dtrace
   script for a while and share the final output?

dtrace -x cleanrate=50hz -n '
        fbt::pmap_tlb_shootnow:entry,
        fbt::uvm_pagermapout:entry {
                self->starttime[probefunc] = timestamp
        }
        fbt::pmap_tlb_shootnow:return,
        fbt::uvm_pagermapout:return /self->starttime[probefunc]/ {
                @[probefunc] = quantize(timestamp -
                    self->starttime[probefunc]);
                self->starttime[probefunc] = 0
        }
        tick-60s {
                printa(@)
        }
'

You may need to modload dtrace_fbt and dtrace_profile first.  The
tick-60s probe will print the current state of data collection once a
minute, showing a histogram of the time spent in the functions
pmap_tlb_shootnow and uvm_pagermapout.

If it says something like

dtrace: 429 dynamic variable drops with non-empty dirty list

then just hit ^C and save the last output.

> Sorry, no crash dump available.

3. Do you just not have a dump device, or are crash dumps broken
   altogether?  Can you test with sysctl debug.crashme?  (sysctl -w
   debug.crashme_enable=1, sysctl -w debug.crashme.panic=1)

Follow-Ups:
- Re: 10.99.7 panic: defibrillate
  - From: Thomas Klausner

References:
- 10.99.7 panic: defibrillate
  - From: Thomas Klausner

Prev by Date: 10.99.7 panic: defibrillate
Next by Date: Re: 10.99.7 panic: defibrillate
Previous by Thread: 10.99.7 panic: defibrillate
Next by Thread: Re: 10.99.7 panic: defibrillate
Indexes:

Home | Main Index | Thread Index | Old Index