tech-net archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Diagnosing npf crashes [was Re: Looking to address two networking issues with NetBSD 10]
> Date: Fri, 18 Oct 2024 21:51:53 +0000 (UTC)
> From: John Klos <john%klos.com@localhost>
>
> Another is that I can reliably panic or lock up aarch64 and amd64 machines
> that run npf while routing a /24 with the most trivial configuration. This
> was discussed here:
>
> https://mail-index.netbsd.org/tech-net/2023/10/12/msg008636.html
The particular bug chronicled here was fixed (PR kern/57208,
https://gnats.NetBSD.org/57208), so let's set this one aside to avoid
confusion.
The relevant symptom is the fault early in stage_mem_gc or thmap_del,
reflecting a null pointer dereference when kmem_intr_alloc fails; now
thmap(9) preallocates this memory so there is no chance of failure
here. You can dispense with all the logs that involve this, such as
<https://www.klos.com/~john/panics/1.txt>.
> While the issue initially was happening with a Raspberry Pi 4, I moved to
> an amd64 system, first with motherboard re0, but I wanted to make sure
> there were no issues related to this:
>
> https://mail-index.netbsd.org/tech-kern/2024/01/27/msg029463.html
> [...]
> With LOCKDEBUG, on amd64:
>
> https://www.klos.com/~john/panics/2.txt
> https://www.klos.com/~john/panics/3.txt
> https://www.klos.com/~john/panics/4.txt
>
> https://www.klos.com/~john/panics/5.txt
> https://www.klos.com/~john/panics/6.txt
> https://www.klos.com/~john/panics/7.txt
The common theme in all these is a giant-locked interrupt handler that
does bus_space_read_2:
bus_space_read_2() at netbsd:bus_space_read_2+0xb
intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x37
It is curious that there is no stack frame between these two, such as
re_intr, which is a plausible intermediary (for example, it calls
bus_space_read_2). Perhaps gdb can find a more detailed stack trace,
either identifying the intermediate frame or showing the arguments to
intr_biglock_wrapper which will tell you what interrupt handler it's
calling.
> Also LOCKDEBUG:
>
> https://www.klos.com/~john/panics/8.txt
What is the difference in configuration between this one and 2-7? Not
obviously the same theme as 2-7.
> After running tcpdump:
>
> https://www.klos.com/~john/panics/9.txt
Appears to be the same issue as 2-7.
> After switching to wm0:
>
> https://www.klos.com/~john/panics/10.txt
> https://www.klos.com/~john/panics/11.txt
> https://www.klos.com/~john/panics/12.txt
These one appear to be spin-waiting for a mutex, whose owner must be
running on another CPU -- there's no sleepq_block in these stack
traces. The difference is probably just that re(4) is giant-locked
while wm(4) is not -- there's still some CPU that's spinning without
sleeping or releasing a lock.
The mutex appears to be mb_cache->pc_pool.pr_lock. Since it's a spin
lock, we can't find out who owns it, but it must be one of the other
CPUs in the system so there aren't too many options to try.
> https://www.klos.com/~john/panics/13.txt
This one is unclear. If it happens again, I would be curious to see
if you get the same stack trace twice by doing `continue' at the ddb
prompt and then entering ddb again.
> After this, I set npf=NO and haven't had any issues since.
>
> What can we do to address this? I've offered to make the machine available
> via serial console when it's in the frozen state, because I'm not sure
> what else I should do.
1. If you have a crash dump you could try getting a stack trace in
gdb. If the stack trace is more detailed, that might tell you what
interrupt handler is being called by intr_biglock_wrapper.
2. Next time this happens, run the following commands in ddb and save
the output:
ps
ps/w
show all tstiles
show event
And, for each CPU number N in 0 1 2 3 ..., do:
mach cpu N
bt
3. Try a current kernel, which might trigger a heartbeat panic with
different diagnostics that might help narrow it down.
4. Try disabling bpfjit by doing `sysctl -w net.bpf.jit=0' before
loading any npf config (or put it in /etc/sysctl.conf).
Home |
Main Index |
Thread Index |
Old Index