kern/56170: NFS-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/56170: NFS-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself
From: "Greg A. Woods" <woods%planix.ca@localhost>
Date: Fri, 14 May 2021 20:45:01 +0000 (UTC)

>Number:         56170
>Category:       kern
>Synopsis:       NFS+gcc-ASAN-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri May 14 20:45:00 +0000 2021
>Originator:     Greg A. Woods
>Release:        NetBSD 9.99.81
>Organization:
Planix, Inc.; Kelowna, BC; Canada
>Environment:
System: NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #16: Thu May 6 13:40:07 PDT 2021 woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0 amd64
Architecture: x86_64
Machine: amd64
>Description:

	I've been trying out the GCC sanitizers on one of my recently
	favourite little projects, and I've found I can reliably crash
	NetBSD with one of the tests, when it is compiled with
	USE_ASAN=yes, at least when it is run with $PWD on an NFS mount.

	Here is the console output from an example crash:


[ 663.0426878] Mutex error: mutex_vector_enter,543: locking against myself

[ 663.0426878] lock address : 0xffffc8800b962b00
[ 663.0426878] current cpu  :                  1
[ 663.0426878] current lwp  : 0xffffc8800b9db1c0
[ 663.0426878] owner field  : 0xffffc8800b9db1c0 wait/spin:                0/0

[ 663.0426878] panic: lock error: Mutex: mutex_vector_enter,543: locking against myself: lock 0xffffc8800b00b9db1c0
[ 663.0426878] cpu1: Begin traceback...
[ 663.0426878] vpanic() at netbsd:vpanic+0x14a
[ 663.0426878] snprintf() at netbsd:snprintf
[ 663.0426878] lockdebug_abort() at netbsd:lockdebug_abort+0xcd
[ 663.0426878] mutex_vector_enter() at netbsd:mutex_vector_enter+0x406
[ 663.0426878] sigpending1() at netbsd:sigpending1+0x24
[ 663.0527222] nfs_sigintr() at netbsd:nfs_sigintr+0x2c
[ 663.0527222] nfs_rcvlock() at netbsd:nfs_rcvlock+0xaf
[ 663.0527222] nfs_request() at netbsd:nfs_request+0x40d
[ 663.0527222] nfs_access() at netbsd:nfs_access+0x1d4
[ 663.0527222] VOP_ACCESS() at netbsd:VOP_ACCESS+0x55
[ 663.0527222] getcwd_common() at netbsd:getcwd_common+0x251
[ 663.0527222] vnode_to_path() at netbsd:vnode_to_path+0xbb
[ 663.0527222] sysctl_vmproc() at netbsd:sysctl_vmproc+0x6cd
[ 663.0527222] sysctl_dispatch() at netbsd:sysctl_dispatch+0xa5
[ 663.0527222] sys___sysctl() at netbsd:sys___sysctl+0xc5
[ 663.0527222] syscall() at netbsd:syscall+0x9c
[ 663.0527222] --- syscall (number 202) ---
[ 663.0527222] netbsd:syscall+0x9c:
[ 663.0527222] cpu1: End traceback...
[ 663.0527222] fatal breakpoint trap in supervisor mode
[ 663.0527222] trap type 1 code 0 rip 0xffffffff8023e93d cs 0xe030 rflags 0x202 cr2 0x7f7ff6892ce0 ilevel 

[ 663.0527222] curlwp 0xffffc8800b9db1c0 pid 6987.6987 lowest kstack 0xffffc880ef49a2c0
Stopped in pid 6987.6987 (yajl_test) at netbsd:breakpoint+0x5:  leave
ds          e650
es          e600
fs          e640
gs          10
rdi         0
rsi         1
rbp         ffffc880ef49e640
rbx         ffffffff80ed2f50    mutex_adaptive_lockops
rdx         2
rcx         0
rax         0
r8          ffffffff80ed2f50    mutex_adaptive_lockops
r9          1
r10         0
r11         fffffffe
r12         104
r13         ffffffff80d43960    ostype+0xa6448
r14         ffffc880ef49e688
r15         ffffffff80d3c46b    ostype+0x9ef53
rip         ffffffff8023e93d    breakpoint+0x5
cs          e030
rflags      202
rsp         ffffc880ef49e640
ss          e02b
netbsd:breakpoint+0x5:  leave
db{1}> (XEN) [2021-05-14 18:09:45.682] Watchdog timer fired for domain 0
(XEN) [2021-05-14 18:09:45.682] Hardware Dom0 shutdown: watchdog rebooting machine

	(I guess ddb.onpanic=1 and the Xen watchdog aren't very useful
	together!)


>How-To-Repeat:

	I don't yet have an isolated example test, but running the
	regression tests in my robohack/yajl project, and in particular
	the "ap_eof_str" test, with USE_ASAN=yes and with the source and
	build on an NFS mount (which I'm only guessing about because of
	the nfs_*() calls in the kernel stack backtrace), has reliably
	reproduced this crash for me:

	$ cd /some/NFS/mountpoint
	$ git clone https://github.com/robohack/yajl
	$ cd yajl
	$ mkdir build
	$ MAKEOBJDIRPREFIX=$(/bin/pwd)/build make regress USE_ASAN=yes MKDOC=no

	If I understand correctly the system call involved here is
	sysctl(2), and that there's something to do with proc too, but
	I'm quite unfamiliar with ASAN runtime internals so I don't know
	what it's doing to cause this, especially since a couple of
	other tests have already run when this one crashes.  I do know
	that ASAN will check to make sure ASLR is not enabled, and it
	will also mmap() something somewhere really high up and it fails
	unless you do "ulimit -v unlimited" first.

	If necessary I can try in a domU, or disable the Xen watchdog
	for the dom0 (as otherwise I only have 20 seconds before the
	reboot!), and try the crash again and do more DDB digging if
	someone can guide me along.  And/Or I can change what's in
	ddb.commandonenter too...

>Fix:

>Unformatted:
 		2021-03-10T23:08:13Z

Follow-Ups:
- Re: kern/56170: NFS-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself
  - From: Christos Zoulas

Home | Main Index | Thread Index | Old Index