NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/56170: NFS-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself



https://www.netbsd.org/~christos/nfs.diff for a disgusting hack I am using to avoid this.

christos

> On May 14, 2021, at 4:45 PM, Greg A. Woods <woods%planix.ca@localhost> wrote:
> 
>> Number:         56170
>> Category:       kern
>> Synopsis:       NFS+gcc-ASAN-related: panic: lock error: Mutex: mutex_vector_enter,543: locking against myself
>> Confidential:   no
>> Severity:       serious
>> Priority:       medium
>> Responsible:    kern-bug-people
>> State:          open
>> Class:          sw-bug
>> Submitter-Id:   net
>> Arrival-Date:   Fri May 14 20:45:00 +0000 2021
>> Originator:     Greg A. Woods
>> Release:        NetBSD 9.99.81
>> Organization:
> Planix, Inc.; Kelowna, BC; Canada
>> Environment:
> System: NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #16: Thu May 6 13:40:07 PDT 2021 woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0 amd64
> Architecture: x86_64
> Machine: amd64
>> Description:
> 
> 	I've been trying out the GCC sanitizers on one of my recently
> 	favourite little projects, and I've found I can reliably crash
> 	NetBSD with one of the tests, when it is compiled with
> 	USE_ASAN=yes, at least when it is run with $PWD on an NFS mount.
> 
> 	Here is the console output from an example crash:
> 
> 
> [ 663.0426878] Mutex error: mutex_vector_enter,543: locking against myself
> 
> [ 663.0426878] lock address : 0xffffc8800b962b00
> [ 663.0426878] current cpu  :                  1
> [ 663.0426878] current lwp  : 0xffffc8800b9db1c0
> [ 663.0426878] owner field  : 0xffffc8800b9db1c0 wait/spin:                0/0
> 
> [ 663.0426878] panic: lock error: Mutex: mutex_vector_enter,543: locking against myself: lock 0xffffc8800b00b9db1c0
> [ 663.0426878] cpu1: Begin traceback...
> [ 663.0426878] vpanic() at netbsd:vpanic+0x14a
> [ 663.0426878] snprintf() at netbsd:snprintf
> [ 663.0426878] lockdebug_abort() at netbsd:lockdebug_abort+0xcd
> [ 663.0426878] mutex_vector_enter() at netbsd:mutex_vector_enter+0x406
> [ 663.0426878] sigpending1() at netbsd:sigpending1+0x24
> [ 663.0527222] nfs_sigintr() at netbsd:nfs_sigintr+0x2c
> [ 663.0527222] nfs_rcvlock() at netbsd:nfs_rcvlock+0xaf
> [ 663.0527222] nfs_request() at netbsd:nfs_request+0x40d
> [ 663.0527222] nfs_access() at netbsd:nfs_access+0x1d4
> [ 663.0527222] VOP_ACCESS() at netbsd:VOP_ACCESS+0x55
> [ 663.0527222] getcwd_common() at netbsd:getcwd_common+0x251
> [ 663.0527222] vnode_to_path() at netbsd:vnode_to_path+0xbb
> [ 663.0527222] sysctl_vmproc() at netbsd:sysctl_vmproc+0x6cd
> [ 663.0527222] sysctl_dispatch() at netbsd:sysctl_dispatch+0xa5
> [ 663.0527222] sys___sysctl() at netbsd:sys___sysctl+0xc5
> [ 663.0527222] syscall() at netbsd:syscall+0x9c
> [ 663.0527222] --- syscall (number 202) ---
> [ 663.0527222] netbsd:syscall+0x9c:
> [ 663.0527222] cpu1: End traceback...
> [ 663.0527222] fatal breakpoint trap in supervisor mode
> [ 663.0527222] trap type 1 code 0 rip 0xffffffff8023e93d cs 0xe030 rflags 0x202 cr2 0x7f7ff6892ce0 ilevel
> 
> [ 663.0527222] curlwp 0xffffc8800b9db1c0 pid 6987.6987 lowest kstack 0xffffc880ef49a2c0
> Stopped in pid 6987.6987 (yajl_test) at netbsd:breakpoint+0x5:  leave
> ds          e650
> es          e600
> fs          e640
> gs          10
> rdi         0
> rsi         1
> rbp         ffffc880ef49e640
> rbx         ffffffff80ed2f50    mutex_adaptive_lockops
> rdx         2
> rcx         0
> rax         0
> r8          ffffffff80ed2f50    mutex_adaptive_lockops
> r9          1
> r10         0
> r11         fffffffe
> r12         104
> r13         ffffffff80d43960    ostype+0xa6448
> r14         ffffc880ef49e688
> r15         ffffffff80d3c46b    ostype+0x9ef53
> rip         ffffffff8023e93d    breakpoint+0x5
> cs          e030
> rflags      202
> rsp         ffffc880ef49e640
> ss          e02b
> netbsd:breakpoint+0x5:  leave
> db{1}> (XEN) [2021-05-14 18:09:45.682] Watchdog timer fired for domain 0
> (XEN) [2021-05-14 18:09:45.682] Hardware Dom0 shutdown: watchdog rebooting machine
> 
> 	(I guess ddb.onpanic=1 and the Xen watchdog aren't very useful
> 	together!)
> 
> 
>> How-To-Repeat:
> 
> 	I don't yet have an isolated example test, but running the
> 	regression tests in my robohack/yajl project, and in particular
> 	the "ap_eof_str" test, with USE_ASAN=yes and with the source and
> 	build on an NFS mount (which I'm only guessing about because of
> 	the nfs_*() calls in the kernel stack backtrace), has reliably
> 	reproduced this crash for me:
> 
> 	$ cd /some/NFS/mountpoint
> 	$ git clone https://github.com/robohack/yajl
> 	$ cd yajl
> 	$ mkdir build
> 	$ MAKEOBJDIRPREFIX=$(/bin/pwd)/build make regress USE_ASAN=yes MKDOC=no
> 
> 	If I understand correctly the system call involved here is
> 	sysctl(2), and that there's something to do with proc too, but
> 	I'm quite unfamiliar with ASAN runtime internals so I don't know
> 	what it's doing to cause this, especially since a couple of
> 	other tests have already run when this one crashes.  I do know
> 	that ASAN will check to make sure ASLR is not enabled, and it
> 	will also mmap() something somewhere really high up and it fails
> 	unless you do "ulimit -v unlimited" first.
> 
> 	If necessary I can try in a domU, or disable the Xen watchdog
> 	for the dom0 (as otherwise I only have 20 seconds before the
> 	reboot!), and try the crash again and do more DDB digging if
> 	someone can guide me along.  And/Or I can change what's in
> 	ddb.commandonenter too...
> 
>> Fix:
> 
>> Unformatted:
> 		2021-03-10T23:08:13Z

Attachment: signature.asc
Description: Message signed with OpenPGP



Home | Main Index | Thread Index | Old Index