possible lwp_lock() issue

To: tech-kern%netbsd.org@localhost
Subject: possible lwp_lock() issue
From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Date: Wed, 2 Jul 2008 22:07:25 +0200

Hi,
I got several panics of this type on a amd64 XEN3_DOM0, with a HVM guest
running (so lots of context switches. The kernel is built with DEBUG+DIAGNOSTIC
(but not LOCKDEBUG). I got this panic several times:
Mutex error: mutex_vector_exit: exiting unheld spin mutex

lock address : 0xffffa0002bca6f48
current cpu  :                  0
current lwp  : 0xffffa0002bcaa000
owner field  : 0x0000000000000700 wait/spin:                0/1

panic: lock error
fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff804abbdd cs e030 rflags 246 cr2  
ffffa0002fa70e00 cpl 7 rsp ffffa0002c62aa80
Stopped in pid 0.5 (system) at  netbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
panic() at netbsd:panic+0x255
lockdebug_abort() at netbsd:lockdebug_abort+0x42
mutex_vector_exit() at netbsd:mutex_vector_exit+0xfd
callout_softclock() at netbsd:callout_softclock+0x1ef
softint_thread() at netbsd:softint_thread+0x88
ds          0xaa90
es          0xc95c
fs          0xaa90
gs          0xca37
rdi         0
rsi         0xdeadbeef
rbp         0xffffa0002c62aa80
rbx         0xffffa0002c62aa90
rdx         0
rcx         0
rax         0x1
r8          0xffffffff80ac6200  cpu_info_primary
r9          0x1
r10         0xffffa0002c62a9a0
r11         0xffffffff804e7ca0  xenconscn_putc
r12         0x100
r13         0xffffffff8084b3e3  copyright+0x19663
r14         0xffffffff80a7fd20  mutex_spin_lockops
r15         0xffffffff803fdcc0  sleepq_timeout
rip         0xffffffff804abbdd  breakpoint+0x5
cs          0xe030
rflags      0x246
rsp         0xffffa0002c62aa80
ss          0xe02b
netbsd:breakpoint+0x5:  leave

First, I think ddb miss a function call which got optimised here,
and mutex_vector_exit() was really called from sleepq_timeout:
the assembly around callout_softclock+0x1ef is:
0xffffffff8040b094 <callout_softclock+484>:     callq  0xffffffff804babd0 
<mutex_spin_exit>
0xffffffff8040b099 <callout_softclock+489>:     mov    %r14,%rdi
0xffffffff8040b09c <callout_softclock+492>:     callq  *%r15
0xffffffff8040b09f <callout_softclock+495>:     mov    %r13,%rdi
0xffffffff8040b0a2 <callout_softclock+498>:     callq  0xffffffff804bab80 
<mutex_spin_enter>

%r15 points to sleepq_timeout, and I think it has not been clobbered when ddb
is called (I disasembled the sleepq_timeout->mutex_vector_exit to make sure).

I guess in sleepq_timeout() we're ending in the (l->l_wchan == NULL) case.

Now the question:
in lwp_lock(), we call lwp_lock_retry() if l->l_mutex got changed while tacking
the lock. But we do it only if LOCKDEBUG || MULTIPROCESSOR, we do a simple
mutex_spin_enter() otherwise (i.e. in a XEN3_DOM0 kernel). Are we sure
l->l_mutex can't be changed in this case ? The panic I'm getting seems to
prove it can be changed ...

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--

Follow-Ups:
- Re: possible lwp_lock() issue
  - From: Andrew Doran

Prev by Date: Re: mount(2) and properties
Next by Date: Re: get rid of MEMORY_DISK_IS_ROOT?
Previous by Thread: get rid of MEMORY_DISK_IS_ROOT?
Next by Thread: Re: possible lwp_lock() issue
Indexes:

Home | Main Index | Thread Index | Old Index