kern/59886: crash dumps are terrible

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/59886: crash dumps are terrible
From: campbell+netbsd%mumble.net@localhost
Date: Sat, 3 Jan 2026 23:00:01 +0000 (UTC)

>Number:         59886
>Category:       kern
>Synopsis:       crash dumps are terrible
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 03 23:00:00 +0000 2026
>Originator:     Taylor R Campbell
>Release:        current, 11, 10, 9, ...
>Organization:
The Core Don't, Inc.
>Environment:
>Description:

	NetBSD crash dumps are terrible.  Here is a litany of issues
	that should all be fixed:

	1. savecore(8) depends on the _running_ kernel's configuration
	   to read out what dumpdev is.

	   It should be able to save a core from any specified dumpdev
	   without asking the running kernel so you can:
	   (a) boot a broken kernel,
	   (b) crash,
	   (c) reboot into a working kernel,
	   (d) savecore from the broken kernel,
	   even if the working and broken kernel have different
	   configurations and default dumpdevs and what not.

	2. The kernel core dump format is undocumented and apparently
	   unreliable, because it often fails in mysterious ways like:

[running /etc/rc.d/savecore]
Checking for core dump...
savecore: msgbuf magic incorrect (706050403020100 != 63061)
savecore: reboot after panic: kernel diagnostic assertion "uvmexp.swpgonly > 0" failed: file "/zfs/source/src/sys/uvm/uvm_anon.c", line 175
savecore: system went down at Sat Jan  3 23:15:44 2026

savecore: writing compressed core to /var/crash/netbsd.2.core.gz
 8086 M
...
  540 K
savecore: writing compressed kernel to /var/crash/netbsd.2.gz
savecore: kvm_read ksyms: _kvm_kvatop(ffffc68022f77000)
savecore: (null): Bad address
/etc/rc.d/savecore exited with code 1

	3. The kernel doesn't compress memory as it dumps so it's very
	   slow and requires an unreasonably large dumpdev to work.

	4. If dumping core doesn't work, the only fallback is to hope
	   that the panic and stack trace are preserved in dmesg on
	   reboot, which often isn't the case -- especially if the
	   system hangs and it is forcibly powered _off_ before the
	   operator powers it back on.  It should be able to take
	   advantage of things like UEFI storage or ACPI APEI ERST
	   storage to store diagnostic information about the crash
	   dump.

	5. Preserving dmesg on reboot also often doesn't work if the
	   previous and current kernel are different and have different
	   parameters; presumably it is not adequately marked in
	   memory.

>How-To-Repeat:

	watch users struggle to get diagnostics out of crashes for PRs

>Fix:

	Yes, please!

Prev by Date: Re: kern/55402 (amd64/9.99.68/GENERIC: xen/zfs - kernel: double fault trap, code=0)
Next by Date: PR/59870 CVS commit: src/sys/kern
Previous by Thread: kern/59885: zfs: unlink/rm is slow to delete last link because it always zil_commits
Next by Thread: Re: bin/55120 (savecore -N doesn't work)
Indexes:

Home | Main Index | Thread Index | Old Index