Subject: Re: frequent panics in sysctl_doeproc
To: Tobias Nygren <tnn@NetBSD.org>
From: Stephen Degler <stephen@degler.net>
List: current-users
Date: 07/30/2007 13:31:46
Tobias Nygren wrote:
> Hi everyone,
>
> I could really use some help analysing this crash.
> My sparc64 box has been panicing alot recently when running top(1).
> The backtrace is rather odd. My current theory is that we assume
> that some page is wired when it actually isn't. Is the any other
> possible reason for taking a fault when attempting to lock a mutex?
>
>> top
> Mutex error: lockdebug_barrier: spin lock held
>
> lock address : 0x0000000028895210 type : spin
> shared holds : 0 exclusive: 1
> shares wanted: 0 exclusive: 0
> current cpu : 0 last held: 0
> current lwp : 0x00000000288a3160 last held: 0x00000000288a3160
> last locked : 0x000000000154c7d4 unlocked : 0x00000000015903f4
> owner field : 0x00ff0a0000000000 wait/spin: 0/1
>
> panic: LOCKDEBUG
> cpu0: kdb breakpoint at 18285cc
> Stopped in pid 266.1 (top) at netbsd:cpu_Debugger+0x8: nop
> db> bt
> panic(19b0c78, 15b9b44, 19b0af0, 19b0b08, ffffffffffffffe8, 5445098)
> +0x1a0
> lockdebug_abort1(288e9c70, 1cc2bc0, 19b0af0, 19b0b08, 5445000, 1000000)
> +0x74
> lockdebug_barrier(1cad8f0, 1, 1, 0, 0, 0) +0x134
> rw_vector_enter(1c79880, 0, 226000, 288fdc58, 0, 0) +0xd8
> vm_map_lock_read(1c79878, 6, 226000, 288fdc58, 0, 0) +0xf067c
> uvmfault_lookup(288fd270, 0, 288fdd88, 288fddd8, 288fdd98, 288fdd84)
> +0x94
> uvm_fault_internal(1c79878, 1019b2000, 1, 0, 0, 1c05800) +0xbc
> data_access_fault(288fd5c0, 30, 1924ad8, 1019b2000, 1019b38a0, 800809)
> +0x5c8
> ?(1cb2300, 154c7d4, 0, 5445098, ffffffffffffffe8, 5445098) 0x1008bb4
> lwp_lock(5445188, 1019b38a0, 8, 1, 5445000, 1000000) 0x1549a74
> fill_kproc2(288951e0, 5445000, 288951e0, 0, 0, 0) +0xa84
> sysctl_doeproc(288fdc68, 4, 226000, 288fdc58, 0, 0) +0x618
> sysctl_dispatch(288fdc60, 6, 226000, 288fdc58, 0, 0) +0x270
> sys___sysctl(288a3160, 288fdd98, 288fdd88, 288fddd8, 288fdd98, 288fdd84)
> +0x1ec
> syscall_plain(288fded0, 8ca, 40b3a86c, 40b3a870, 0, badcafe) +0x194
> ?(ffffffffffffc658, 6, 226000, ffffffffffffc670, 0, 0) at 0x10093e4
> db>
>
> Here's the kernel current configuration:
>
> include "arch/sparc64/conf/GENERIC"
> options NMBCLUSTERS=4096
> options NKMEMPAGES=65536
> makeoptions DEBUG="-g"
> makeoptions COPTS="-O0"
> options DEBUG
> options DIAGNOSTIC
> options LOCKDEBUG
> options INSECURE
> pseudo-device pf
> pseudo-device pflog
> no pseudo-device veriexec
> no options FILEASSOC
>
> I've uploaded "nm -n" and "objdump -d" output here:
>
> http://www.netilium.org/~tnn/20070730/
>
> -Tobias
>
>
I have a similar issue with a kernel w/o lockdebug or multiprocessor I
see panics in strncpy.
With the following patch to init_sysctl.c, the second assertion *is*
hit. The value 0xffffffff00000001 is something that I found in ddb,
because the NULL test alone was not hitting.
isn't this (int)1?
@@ -2820,9 +2824,15 @@
ki->p_holdcnt = l->l_holdcnt;
ki->p_priority = l->l_priority;
ki->p_usrpri = l->l_usrpri;
- if (l->l_wmesg)
+ if (l->l_wchan) {
+ ki->p_wchan = PTRTOUINT64(l->l_wchan);
+ KASSERT(ki->p_wmesg != NULL);
+ KASSERT(l->l_wmesg != NULL &&
+ (unsigned long)(l->l_wmesg) !=
+ 0xffffffff00000001);
strncpy(ki->p_wmesg, l->l_wmesg,
sizeof(ki->p_wmesg));
- ki->p_wchan = PTRTOUINT64(l->l_wchan);
+ }
+
lwp_unlock(l);
}
skd