Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: 10.99.7 panic: defibrillate



On Sat, Aug 12, 2023 at 04:03:59PM +0000, Taylor R Campbell wrote:
> This panic means that one CPU has detected that another CPU has failed
> to run either the hardclock interrupt handler or the SOFTINT_CLOCK
> softints in over 15 seconds, and triggered an interprocessor interrupt
> in an attempt to panic rather than stay stuck where it appears to be
> stuck -- here, pmap_tlb_shootnow.
> 
> Normally the hardclock interrupt handler runs every 10ms (or 1/hz sec;
> default hz=100), and softints run reasonably promptly, so failing to
> do this for 15 sec is extremely unusual and likely indicates a CPU is
> wedged and unable to make progress.  For example, something may be
> stuck in an infinite loop with a spin lock held or spl raised, which
> blocks interrupts.
> 
> (The HEARTBEAT option, this system where CPUs check one another for
> progress, is new as of last month.  The problems it uncovers would
> likely have manifested as silent unresponsive hang before.)
> 
> 1. Did you notice anything sluggish before the crash?

I was active on the machine in a remote terminal and I didn't notice
anything in particular. The machine has 32 (virtual) cores so I
probably wouldn't notice one stuck - except that I would notice if
pbulk never finished because one was stuck.

I know that this machine is sometimes sluggish when all three parallel
pbulk clients want to interact with the disk (e.g. libreoffice and
rust unpacking at the same time, or something similar).

> 2. Can you start another bulk build and run the following dtrace
>    script for a while and share the final output?
> 
> dtrace -x cleanrate=50hz -n '
>         fbt::pmap_tlb_shootnow:entry,
>         fbt::uvm_pagermapout:entry {
>                 self->starttime[probefunc] = timestamp
>         }
>         fbt::pmap_tlb_shootnow:return,
>         fbt::uvm_pagermapout:return /self->starttime[probefunc]/ {
>                 @[probefunc] = quantize(timestamp -
>                     self->starttime[probefunc]);
>                 self->starttime[probefunc] = 0
>         }
>         tick-60s {
>                 printa(@)
>         }
> '
> 
> You may need to modload dtrace_fbt and dtrace_profile first.  The
> tick-60s probe will print the current state of data collection once a
> minute, showing a histogram of the time spent in the functions
> pmap_tlb_shootnow and uvm_pagermapout.
> 
> If it says something like
> 
> dtrace: 429 dynamic variable drops with non-empty dirty list
> 
> then just hit ^C and save the last output.

I left it running for a bit and this is what it said:

# ./dtrace-script-20230812.sh
dtrace: description '
        fbt::pmap_tlb_shootnow:entry,
        fbt::uvm_pagermapout:entry ' matched 5 probes
dtrace: aggregation size lowered to 2m
dtrace: 312026 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 22746 dynamic variable drops with non-empty rinsing list
: Operation timed out
dtrace: 255016 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 341982 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 273456 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 11589 dynamic variable drops with non-empty rinsing list
: Operation timed out
dtrace: 303313 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 312509 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 345693 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 16682 dynamic variable drops with non-empty rinsing list
: Operation timed out
dtrace: 333801 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 327437 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 412853 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 539988 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 471254 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 501274 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 475914 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 501722 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 400591 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 370924 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 395296 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 276777 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 255151 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 300495 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 263274 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 390073 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 316773 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 282110 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 371249 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 40912 dynamic variable drops with non-empty rinsing list
: Operation timed out
dtrace: 289925 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 267312 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 293607 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 259538 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 290060 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 313933 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 10325 dynamic variable drops with non-empty rinsing list
: Operation timed out
dtrace: 274132 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 297795 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 217786 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 272377 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 290762 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 427197 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 470297 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 434802 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 462612 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 475668 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 412965 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 461865 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 376524 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 26804 dynamic variable drops with non-empty rinsing list
: Operation timed out
dtrace: 425786 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 396083 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 396933 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 399609 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 379957 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 469777 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 406434 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 416873 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 432938 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 463341 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 449017 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 501371 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 442401 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 459855 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 422771 dynamic variable drops with non-empty dirty list
: Operation timed out
CPU     ID                    FUNCTION:NAME
  6  58273                        :tick-60s
  uvm_pagermapout
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |                                         1
            4096 |@@@@@@@@@@@@                             2841
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@                 5659
           16384 |@@@                                      750
           32768 |                                         48
           65536 |                                         9
          131072 |                                         0

  pmap_tlb_shootnow
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@                   3770128
            2048 |@@@@@@@@@@@@@@@@                         2846514
            4096 |@@                                       316604
            8192 |                                         16533
           16384 |                                         2766
           32768 |                                         621
           65536 |                                         261
          131072 |                                         26
          262144 |                                         1
          524288 |                                         14
         1048576 |                                         3
         2097152 |                                         0


dtrace: 466696 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 452047 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 415726 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 433259 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 710584 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 600722 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 696242 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 751837 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 438584 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 420030 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 418731 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 380829 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 16487 dynamic variable drops with non-empty rinsing list
: Operation timed out
dtrace: 370819 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 404504 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 450560 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 446036 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 376850 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 376100 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 384559 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 352970 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 11701 dynamic variable drops with non-empty rinsing list
: Operation timed out
dtrace: 361743 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 389721 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 322743 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 377573 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 386278 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 371971 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 378625 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 378767 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 368735 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 369729 dynamic variable drops with non-empty dirty list
: Operation timed out
dtrace: 432366 dynamic variable drops with non-empty dirty list
: Operation timed out
^C
dtrace: 518343 dynamic variable drops with non-empty dirty list
: Operation timed out

> > Sorry, no crash dump available.
> 
> 3. Do you just not have a dump device, or are crash dumps broken
>    altogether?  Can you test with sysctl debug.crashme?  (sysctl -w
>    debug.crashme_enable=1, sysctl -w debug.crashme.panic=1)

I tried 'reboot 0x104' (or 0x108, I'm not sure any longer) and it
started the dump but didn't start counting down, so I reset the
machine.
 Thomas


Home | Main Index | Thread Index | Old Index