kern/57181: LOCKDEBUG panic with npf

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/57181: LOCKDEBUG panic with npf
From: brad%anduin.eldar.org@localhost
Date: Wed, 11 Jan 2023 00:55:00 +0000 (UTC)

>Number:         57181
>Category:       kern
>Synopsis:       LOCKDEBUG panic with npf
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jan 11 00:55:00 +0000 2023
>Originator:     Brad Spencer
>Release:        NetBSD 10.0_BETA
>Organization:
	eldar.org
>Environment:
	
	
System: NetBSD testcurrent.nat.eldar.org 10.0_BETA NetBSD 10.0_BETA (GENERIC_LOCKDEBUG) #0: Tue Jan 10 16:55:28 EST 2023  brad%samwise.nat.eldar.org@localhost:/usr/src/sys/arch/amd64/compile/GENERIC_LOCKDEBUG amd64
Architecture: x86_64
Machine: amd64
>Description:

LOCKDEBUG compiled into 10.0_BETA pulled on 2023-01-10 running on a
PVH DOMU.  Single processor, 8GB of memory.  Neither of those is
likely very important.

The following panic can be triggered with some set up pretty simply:

[ 118.6825506] Mutex error: rw_vector_enter,309: spin lock held

[ 118.6825506] lock address : ffff8f3f683a75b8
[ 118.6825506] type         : spin
[ 118.6825506] initialized  : netbsd:npf_table_create+0x99
[ 118.6825506] shared holds :                  0 exclusive:                  1
[ 118.6825506] shares wanted:                  0 exclusive:                  0
[ 118.6825506] relevant cpu :                  0 last held:                  0
[ 118.6825506] relevant lwp : 0xffff8f3f672a6300 last held: 0xffff8f3f672a6300
[ 118.6825506] last locked* : netbsd:npf_table_list+0x34
[ 118.6825506] unlocked     : netbsd:npf_table_list+0x62
[ 118.6825506] owner field  : 0x0000000000010600 wait/spin:                0/1

[ 118.6825506] panic: LOCKDEBUG: Mutex error: rw_vector_enter,309: spin lock held
[ 118.6825506] cpu0: Begin traceback...
[ 118.6825506] vpanic() at netbsd:vpanic+0x183
[ 118.6825506] panic() at netbsd:panic+0x3c
[ 118.6825506] lockdebug_abort1() at netbsd:lockdebug_abort1+0xe6
[ 118.6825506] rw_enter() at netbsd:rw_enter+0x43b
[ 118.6825506] uvm_fault_internal() at netbsd:uvm_fault_internal+0x111
[ 118.6825506] trap() at netbsd:trap+0x47d
[ 118.6825506] --- trap (number 6) ---
[ 118.6825506] copyout() at netbsd:copyout+0x33
[ 118.6825506] npf_table_list() at netbsd:npf_table_list+0x57
[ 118.6825506] npfctl_table() at netbsd:npfctl_table+0xf7
[ 118.6825506] cdev_ioctl() at netbsd:cdev_ioctl+0x99
[ 118.6825506] spec_ioctl() at netbsd:spec_ioctl+0x58
[ 118.6825506] VOP_IOCTL() at netbsd:VOP_IOCTL+0x47
[ 118.6825506] vn_ioctl() at netbsd:vn_ioctl+0xaf
[ 118.6825506] sys_ioctl() at netbsd:sys_ioctl+0x56d
[ 118.6825506] syscall() at netbsd:syscall+0x196
[ 118.6825506] --- syscall (number 54) ---
[ 118.6825506] netbsd:syscall+0x196:
[ 118.6825506] cpu0: End traceback...
[ 118.6825506] fatal breakpoint trap in supervisor mode
[ 118.6825506] trap type 1 code 0 rip 0xffffffff80235315 cs 0x8 rflags 0x202 cr2 0x724c29d9e180 ilevel 0x8 rsp 0xffffdb8240a7b5e0
[ 118.6825506] curlwp 0xffff8f3f672a6300 pid 1195.1195 lowest kstack 0xffffdb8240a772c0

>How-To-Repeat:
<code/input/activities to reproduce the problem
	(multiple lines)>

Given a /etc/npf.conf file that containes this:

table <blocklist> type ipset

procedure "log" {
          log: npflog0
}

group default {
      pass in all
      pass out all
}


Given a shell script that does this:

#!/bin/sh

for a in `cat /etc/blocklist`
do
    /sbin/npfctl table blocklist add $a > /dev/null 2>&1
done

The file /etc/blocklist contains a list of IP addresses.  The number
may not matter just too much, but needs to be large enough to cause
the script to run for a while (depending on how you run the test).  I
typically have one that is almost 200,000 addresses.  I also used one
that was 1000 addresses and the panic will happen with that amount.
However, leaving the table empty or just putting one address in didn't
trip the panic.

Set up the system as mentioned above, and run the script to load the
table.  The reason this is needed is that if there are a large number
of addresses present NPF has a problem with loading the table with
npfctl.  So, that part is a cheat... to work around the problem of
large tables and npfctl.  Note that the cheat isn't needed if you use
1000 addresses.

Now, either as the load is happening for a bit, or after it is done
loading do the following:

npfctl table blocklist list | wc

You will panic with the mentioned LOCKDEBUG panic above.

(I will also mention that if I load the 1000 addresses with
/etc/npf.conf and then break into ddb after the system is up and do a
"show locks" it does not show anything unusual, so something is
managing to hold the npf_table_create even though the table is already
there and only when npfctl list is performed)

>Fix:

I am VERY hopeful that someone can see what the fix is.  I also
suspect that it is the root of kern/57136, but that is a guess on my
part.  Further, as NPF is suppose to be the firewall of choice now
this should probably be looked into.

Prev by Date: Re: bin/57180 (sh fails to foreground partially terminated pipeline job)
Next by Date: Re: bin/57180 (sh fails to foreground partially terminated pipeline job)
Previous by Thread: PR/57149 CVS commit: src/sys/dev/usb
Next by Thread: kern/57182: nouveau doesn't switches LVDS off
Indexes:

Home | Main Index | Thread Index | Old Index