Diagnosing npf crashes [was Re: Looking to address two networking issues with NetBSD 10]

To: John Klos <john%klos.com@localhost>
Subject: Diagnosing npf crashes [was Re: Looking to address two networking issues with NetBSD 10]
From: Taylor R Campbell <campbell+netbsd-tech-net%mumble.net@localhost>
Date: Fri, 18 Oct 2024 22:44:00 +0000

> Date: Fri, 18 Oct 2024 21:51:53 +0000 (UTC)
> From: John Klos <john%klos.com@localhost>
> 
> Another is that I can reliably panic or lock up aarch64 and amd64 machines 
> that run npf while routing a /24 with the most trivial configuration. This 
> was discussed here:
> 
> https://mail-index.netbsd.org/tech-net/2023/10/12/msg008636.html

The particular bug chronicled here was fixed (PR kern/57208,
https://gnats.NetBSD.org/57208), so let's set this one aside to avoid
confusion.

The relevant symptom is the fault early in stage_mem_gc or thmap_del,
reflecting a null pointer dereference when kmem_intr_alloc fails; now
thmap(9) preallocates this memory so there is no chance of failure
here.  You can dispense with all the logs that involve this, such as
<https://www.klos.com/~john/panics/1.txt>.

> While the issue initially was happening with a Raspberry Pi 4, I moved to 
> an amd64 system, first with motherboard re0, but I wanted to make sure 
> there were no issues related to this:
> 
> https://mail-index.netbsd.org/tech-kern/2024/01/27/msg029463.html
> [...]
> With LOCKDEBUG, on amd64:
> 
> https://www.klos.com/~john/panics/2.txt
> https://www.klos.com/~john/panics/3.txt
> https://www.klos.com/~john/panics/4.txt
> 
> https://www.klos.com/~john/panics/5.txt
> https://www.klos.com/~john/panics/6.txt
> https://www.klos.com/~john/panics/7.txt

The common theme in all these is a giant-locked interrupt handler that
does bus_space_read_2:

bus_space_read_2() at netbsd:bus_space_read_2+0xb
intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x37

It is curious that there is no stack frame between these two, such as
re_intr, which is a plausible intermediary (for example, it calls
bus_space_read_2).  Perhaps gdb can find a more detailed stack trace,
either identifying the intermediate frame or showing the arguments to
intr_biglock_wrapper which will tell you what interrupt handler it's
calling.

> Also LOCKDEBUG:
> 
> https://www.klos.com/~john/panics/8.txt

What is the difference in configuration between this one and 2-7?  Not
obviously the same theme as 2-7.

> After running tcpdump:
> 
> https://www.klos.com/~john/panics/9.txt

Appears to be the same issue as 2-7.

> After switching to wm0:
> 
> https://www.klos.com/~john/panics/10.txt
> https://www.klos.com/~john/panics/11.txt
> https://www.klos.com/~john/panics/12.txt

These one appear to be spin-waiting for a mutex, whose owner must be
running on another CPU -- there's no sleepq_block in these stack
traces.  The difference is probably just that re(4) is giant-locked
while wm(4) is not -- there's still some CPU that's spinning without
sleeping or releasing a lock.

The mutex appears to be mb_cache->pc_pool.pr_lock.  Since it's a spin
lock, we can't find out who owns it, but it must be one of the other
CPUs in the system so there aren't too many options to try.

> https://www.klos.com/~john/panics/13.txt

This one is unclear.  If it happens again, I would be curious to see
if you get the same stack trace twice by doing `continue' at the ddb
prompt and then entering ddb again.

> After this, I set npf=NO and haven't had any issues since.
> 
> What can we do to address this? I've offered to make the machine available 
> via serial console when it's in the frozen state, because I'm not sure 
> what else I should do.

1. If you have a crash dump you could try getting a stack trace in
   gdb.  If the stack trace is more detailed, that might tell you what
   interrupt handler is being called by intr_biglock_wrapper.

2. Next time this happens, run the following commands in ddb and save
   the output:

   ps
   ps/w
   show all tstiles
   show event

   And, for each CPU number N in 0 1 2 3 ..., do:

   mach cpu N
   bt

3. Try a current kernel, which might trigger a heartbeat panic with
   different diagnostics that might help narrow it down.

4. Try disabling bpfjit by doing `sysctl -w net.bpf.jit=0' before
   loading any npf config (or put it in /etc/sysctl.conf).

Follow-Ups:
- Re: Diagnosing npf crashes [was Re: Looking to address two networking issues with NetBSD 10]
  - From: Taylor R Campbell

References:
- Looking to address two networking issues with NetBSD 10
  - From: John Klos

Prev by Date: Re: Looking to address two networking issues with NetBSD 10
Next by Date: Diagnosing dhcpcd crashes [was Re: Looking to address two networking issues with NetBSD 10]
Previous by Thread: Re: Looking to address two networking issues with NetBSD 10
Next by Thread: Re: Diagnosing npf crashes [was Re: Looking to address two networking issues with NetBSD 10]
Indexes:

Home | Main Index | Thread Index | Old Index