current-users: Re: frequent panics in sysctl

Subject: Re: frequent panics in sysctl_doeproc
To: Tobias Nygren <tnn@NetBSD.org>
From: Stephen Degler <stephen@degler.net>
List: current-users
Date: 07/30/2007 13:31:46
Tobias Nygren wrote:
> Hi everyone,
> 
> I could really use some help analysing this crash.
> My sparc64 box has been panicing alot recently when running top(1).
> The backtrace is rather odd. My current theory is that we assume
> that some page is wired when it actually isn't. Is the any other
> possible reason for taking a fault when attempting to lock a mutex?
> 
>> top
> Mutex error: lockdebug_barrier: spin lock held
> 
> lock address : 0x0000000028895210 type     :               spin
> shared holds :                  0 exclusive:                  1
> shares wanted:                  0 exclusive:                  0
> current cpu  :                  0 last held:                  0
> current lwp  : 0x00000000288a3160 last held: 0x00000000288a3160
> last locked  : 0x000000000154c7d4 unlocked : 0x00000000015903f4
> owner field  : 0x00ff0a0000000000 wait/spin:                0/1
> 
> panic: LOCKDEBUG
> cpu0: kdb breakpoint at 18285cc
> Stopped in pid 266.1 (top) at   netbsd:cpu_Debugger+0x8:        nop
> db> bt
> panic(19b0c78, 15b9b44, 19b0af0, 19b0b08, ffffffffffffffe8, 5445098)
>   +0x1a0
> lockdebug_abort1(288e9c70, 1cc2bc0, 19b0af0, 19b0b08, 5445000, 1000000)
>   +0x74
> lockdebug_barrier(1cad8f0, 1, 1, 0, 0, 0) +0x134
> rw_vector_enter(1c79880, 0, 226000, 288fdc58, 0, 0) +0xd8
> vm_map_lock_read(1c79878, 6, 226000, 288fdc58, 0, 0) +0xf067c
> uvmfault_lookup(288fd270, 0, 288fdd88, 288fddd8, 288fdd98, 288fdd84)
>    +0x94
> uvm_fault_internal(1c79878, 1019b2000, 1, 0, 0, 1c05800) +0xbc
> data_access_fault(288fd5c0, 30, 1924ad8, 1019b2000, 1019b38a0, 800809)
>    +0x5c8
> ?(1cb2300, 154c7d4, 0, 5445098, ffffffffffffffe8, 5445098) 0x1008bb4
> lwp_lock(5445188, 1019b38a0, 8, 1, 5445000, 1000000) 0x1549a74
> fill_kproc2(288951e0, 5445000, 288951e0, 0, 0, 0) +0xa84
> sysctl_doeproc(288fdc68, 4, 226000, 288fdc58, 0, 0) +0x618
> sysctl_dispatch(288fdc60, 6, 226000, 288fdc58, 0, 0) +0x270
> sys___sysctl(288a3160, 288fdd98, 288fdd88, 288fddd8, 288fdd98, 288fdd84)
>    +0x1ec
> syscall_plain(288fded0, 8ca, 40b3a86c, 40b3a870, 0, badcafe) +0x194
> ?(ffffffffffffc658, 6, 226000, ffffffffffffc670, 0, 0) at 0x10093e4
> db>
> 
> Here's the kernel current configuration:
> 
> include "arch/sparc64/conf/GENERIC"
> options NMBCLUSTERS=4096
> options NKMEMPAGES=65536
> makeoptions    DEBUG="-g"
> makeoptions     COPTS="-O0"
> options DEBUG
> options DIAGNOSTIC
> options LOCKDEBUG
> options INSECURE
> pseudo-device pf
> pseudo-device pflog
> no pseudo-device veriexec
> no options FILEASSOC
> 
> I've uploaded "nm -n" and "objdump -d" output here:
> 
> http://www.netilium.org/~tnn/20070730/
> 
> -Tobias
> 
> 

I have a similar issue with a kernel w/o lockdebug or multiprocessor I 
see panics in strncpy.

With the following patch to init_sysctl.c, the second assertion *is* 
hit.  The value 0xffffffff00000001 is something that I found in ddb, 
because the NULL test alone was not hitting.

isn't this (int)1?

@@ -2820,9 +2824,15 @@
                 ki->p_holdcnt = l->l_holdcnt;
                 ki->p_priority = l->l_priority;
                 ki->p_usrpri = l->l_usrpri;
-               if (l->l_wmesg)
+               if (l->l_wchan) {
+                       ki->p_wchan = PTRTOUINT64(l->l_wchan);
+                       KASSERT(ki->p_wmesg != NULL);
+                       KASSERT(l->l_wmesg != NULL &&
+                               (unsigned long)(l->l_wmesg) !=
+                               0xffffffff00000001);
                         strncpy(ki->p_wmesg, l->l_wmesg, 
sizeof(ki->p_wmesg));
-               ki->p_wchan = PTRTOUINT64(l->l_wchan);
+               }
+
                 lwp_unlock(l);
         }


skd