Subject: Re: mutex fault
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Andrew Doran <ad@NetBSD.org>
List: tech-kern
Date: 12/07/2007 13:45:48
On Fri, Dec 07, 2007 at 12:11:23PM +0100, Manuel Bouyer wrote:

> [ moved to tech-kern to get wider audience ]
> 
> Hi,
> does anyone with some clues about locking have some comments about this ?
> I've also seen this on a xenamd64 system. This system has raidframe
> and LFS (the LFS is not on a RAID partition) if that matters.

I saw similar problems on i386 when the compiler was reordering the
statements in MUTEX_SPIN_SPLRAISE/SPLRESTORE in kern_mutex.c, however that
was fixed a very long time ago. splraise/spllower are function calls on xen
so there should be no problems, and reading the disassembly it looks OK.

It may be something to do with the SPL, because on a single processor system
there is no other race than can occur in that path. Can an interrupt on Xen
access the scheduler at IPL_HIGH?

> Kazushi, do you use raidframe or LFS on your system ?

Can you try to reproduce the problem with LOCKDEBUG enabled? It may give more
useful information.

Thanks,
Andrew

> On Wed, Nov 28, 2007 at 07:00:50PM +0900, Kazushi Marukawa wrote:
> > Thanks.  I got new crash log from -O0 kernel and it was
> > different.  There are two functions between
> > callout_softclock and mutex_vector_exit.  lwp_unlock and
> > sleepq_timeout functions.
> > 
> > The sleepq_timeout is:
> > 
> >     lwp_lock(l);
> >     if (l->l_wchan == NULL) {
> >         lwp_unlock(l);
> >         return;
> >     }
> > 
> > Calling lwp_lock(l) and low_unlock(l) in above sequence was
> > causing my problem.  hehe.  I guess some rare race condition
> > is causing this, but not sure where.  Is it possible to find
> > the source of problem from the trace information?  I added
> > the trace log at the bottom of this mail.
> > 
> > On the other hand, I can put some codes to trap the
> > condition if a patch to modify source to trap it is
> > available.  Please let me konw for the case.
> > 
> > Thanks.
> > 
> > 
> > Trace log is here:
> > ----
> > Mutex error: mutex_vector_exit: exiting unheld spin mutex
> > 
> > lock address : 0x00000000c0b4ed80
> > current cpu  :                  0
> > current lwp  : 0x00000000ca965e00
> > owner field  : 0x0000000000000b00 wait/spin:                0/1
> > 
> > panic: lock error
> > fatal breakpoint trap in supervisor mode
> > trap type 1 code 0 eip c05ce919 cs 9 eflags 286 cr2 0 ilevel b
> > Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x1:  ret
> > db> trace
> > breakpoint(c08d6542,cb192e34,0,0,cb192e34) at netbsd:breakpoint+0x1
> > panic(c08d584b,c08d1868,c08d19d1,c08d19e3,c0b4ed80) at netbsd:panic+0xcc
> > lockdebug_abort(c0b4ed80,c0aa52ec,c08d19d1,c08d19e3,c0b4df06) at netbsd:lockdebug_abort+0x6a
> > mutex_abort(c0b4ed80,c08d19d1,c08d19e3,0,c0cc4000) at netbsd:mutex_abort+0x42
> > mutex_vector_exit(c0b4ed80,b360,cb192ed8,c04fd6f2,cd5788c0) at netbsd:mutex_vector_exit+0x43
> > lwp_unlock(cd5788c0,ca003b9a,989680,c0aeb360,c0aeb360) at 0xc04fd72b
> > sleepq_timeout(cd5788c0,0,64,c0617c80,1) at netbsd:sleepq_timeout+0x2f
> > callout_softclock(0,0,cb192fa8,c0b52d00,c180dcc0) at netbsd:callout_softclock+0x537
> > softintr_dispatch(0,cb192f78,f,0,c04e35d4) at netbsd:softintr_dispatch+0x9a
> > DDB lost frame for netbsd:Xsoftclock+0x3c, trying 0xcb192f70
> > Xsoftclock() at netbsd:Xsoftclock+0x3c
> > --- interrupt ---
> > --- switch to interrupt stack ---
> > ?(bb784000,0,0,0,0) at 0
> > db> reboot
> > syncing disks... Mutex error: mutex_vector_enter: locking against myself
> > 
> > lock address : 0x00000000c0b4df04
> > current cpu  :                  0
> > current lwp  : 0x00000000ca965e00
> > owner field  : 0x0000000000010b00 wait/spin:                0/1
> > 
> > panic: lock error
> > fatal breakpoint trap in supervisor mode
> > trap type 1 code 0 eip c05ce919 cs 9 eflags 296 cr2 0 ilevel c
> > Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x1:  ret
> > db> trace
> > breakpoint(c08d6542,cb1929fc,0,0,cb1929fc) at netbsd:breakpoint+0x1
> > panic(c08d584b,c08d1868,c08d1985,c08d1998,c0b4df04) at netbsd:panic+0xcc
> > lockdebug_abort(c0b4df04,c0aa52ec,c08d1985,c08d1998,b) at netbsd:lockdebug_abort+0x6a
> > mutex_abort(c0b4df04,c08d1985,c08d1998,c04ef12d,cd561026) at netbsd:mutex_abort+0x42
> > mutex_vector_enter(c0b4df04,c,cb192ab0,c0501184,cd379c40) at netbsd:mutex_vector_enter+0x86
> > lwp_lock(cd379c40,0,cb192ac0,c01f9d29,cd379c40) at 0xc0500251
> > suspendsched(c08db64a,0,cb192ae0,c01f9dc1,cb192ad8) at netbsd:suspendsched+0x77
> > vfs_shutdown(cb192b1c,1,cb192b20,c01f8e83,0) at netbsd:vfs_shutdown+0x2c
> > cpu_reboot(0,0,c01f8463,c0b3c4e0,c0881560) at netbsd:cpu_reboot+0x3f
> > db_sifting_cmd(1,0,0,cb192b58,8) at netbsd:db_sifting_cmd
> > db_command(c0a9c5c0,bfeffc04,cb192c20,9,cb192bf4) at netbsd:db_command+0x39c
> > db_command_loop(c05ce919,0,2,c0aa55ed,cb192c0c) at netbsd:db_command_loop+0x96
> > db_trap(1,0,58,0,2) at netbsd:db_trap+0x1a2
> > kdb_trap(1,0,cb192dac,c05ce919,9) at netbsd:kdb_trap+0x161
> > trap() at netbsd:trap+0x423
> > --- trap (number 1) ---
> > breakpoint(c08d6542,cb192e34,0,0,cb192e34) at netbsd:breakpoint+0x1
> > panic(c08d584b,c08d1868,c08d19d1,c08d19e3,c0b4ed80) at netbsd:panic+0xcc
> > lockdebug_abort(c0b4ed80,c0aa52ec,c08d19d1,c08d19e3,c0b4df06) at netbsd:lockdebug_abort+0x6a
> > mutex_abort(c0b4ed80,c08d19d1,c08d19e3,0,c0cc4000) at netbsd:mutex_abort+0x42
> > mutex_vector_exit(c0b4ed80,b360,cb192ed8,c04fd6f2,cd5788c0) at netbsd:mutex_vector_exit+0x43
> > lwp_unlock(cd5788c0,ca003b9a,989680,c0aeb360,c0aeb360) at 0xc04fd72b
> > sleepq_timeout(cd5788c0,0,64,c0617c80,1) at netbsd:sleepq_timeout+0x2f
> > callout_softclock(0,0,cb192fa8,c0b52d00,c180dcc0) at netbsd:callout_softclock+0x537
> > softintr_dispatch(0,cb192f78,f,0,c04e35d4) at netbsd:softintr_dispatch+0x9a
> > DDB lost frame for netbsd:Xsoftclock+0x3c, trying 0xcb192f70
> > Xsoftclock() at netbsd:Xsoftclock+0x3c
> > --- interrupt ---
> > --- switch to interrupt stack ---
> > ?(bb784000,0,0,0,0) at 0
> > db> reboot
> > Mutex error: mutex_vector_enter: locking against myself
> > 
> > lock address : 0x00000000c0b4df04
> > current cpu  :                  0
> > current lwp  : 0x00000000ca965e00
> > owner field  : 0x0000000000010b00 wait/spin:                0/1
> > 
> > panic: lock error
> > fatal breakpoint trap in supervisor mode
> > trap type 1 code 0 eip c05ce919 cs 9 eflags 282 cr2 0 ilevel b
> > Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x1:  ret
> > db> trace
> > breakpoint(c08d6542,cb191fa8,0,0,cb191fa8) at netbsd:breakpoint+0x1
> > panic(c08d584b,c08d1868,c08d1985,c08d1998,c0b4df04) at netbsd:panic+0xcc
> > lockdebug_abort(c0b4df04,c0aa52ec,c08d1985,c08d1998,b) at netbsd:lockdebug_abort+0x6a
> > mutex_abort(c0b4df04,c08d1985,c08d1998,c04ef12d,c0b4edc6) at netbsd:mutex_abort+0x42
> > mutex_vector_enter(c0b4df04,2,cb19205c,c0528892,cd22e1c0) at netbsd:mutex_vector_enter+0x86
> > lwp_lock(cd22e1c0,0,cb19209c,c055032b,2) at 0xc05288e9
> > selwakeup(c18b494c,0,cb19209c,c054fa93,c18b494c) at netbsd:selwakeup+0x69
> > selnotify(c18b494c,0,c19af600,c19af600,42) at netbsd:selnotify+0x11
> > sowakeup(c18b48c8,c18b494c,1,c19af600,1) at netbsd:sowakeup+0x19
> > udp4_sendup(c18a3e00,1c,cb192188,c18b48c8,c18a3e00) at netbsd:udp4_sendup+0x171
> > udp4_realinput(cb192188,cb192178,cb1921c4,1c,ca9623c0) at netbsd:udp4_realinput+0x15a
> > udp_input(c18a3e00,14,11,1,c0b4da44) at netbsd:udp_input+0x223
> > ip_input(c18a3e00,800,c18a8d00,caf7d030,2) at netbsd:ip_input+0xae7
> > ipintr(cb192278,cb190011,c05e0031,c01f0011,11) at netbsd:ipintr+0x6d
> > Xsoftnet() at netbsd:Xsoftnet+0x57
> > --- interrupt ---
> > 0x9:
> > db> reboot
> > Mutex error: mutex_vector_enter: locking against myself
> > 
> > lock address : 0x00000000cd561024
> > current cpu  :                  0
> > current lwp  : 0x00000000ca965e00
> > owner field  : 0x0000000000010b00 wait/spin:                0/1
> > 
> > panic: lock error
> > fatal breakpoint trap in supervisor mode
> > trap type 1 code 0 eip c05ce919 cs 9 eflags 292 cr2 0 ilevel b
> > Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x1:  ret
> > db> trace
> > breakpoint(c08d6542,cb191970,0,0,cb191970) at netbsd:breakpoint+0x1
> > panic(c08d584b,c08d1868,c08d1985,c08d1998,cd561024) at netbsd:panic+0xcc
> > lockdebug_abort(cd561024,c0aa52ec,c08d1985,c08d1998,b) at netbsd:lockdebug_abort+0x6a
> > mutex_abort(cd561024,c08d1985,c08d1998,c04ef349,c0b4e536) at netbsd:mutex_abort+0x42
> > mutex_vector_enter(cd561024,2000000,cb191a24,c05e145c,1000000) at netbsd:mutex_vector_enter+0x86
> > sched_pstats(0,c19c8418,c18a4200,c1a42000,762cf000) at netbsd:sched_pstats+0x39
> > callout_softclock(0,c19b2d00,800,c0b52d00,c180dcc0) at netbsd:callout_softclock+0x52a
> > softintr_dispatch(0,cb191b14,cb190011,c05e0031,c01f0011) at netbsd:softintr_dispatch+0x9a
> > Xsoftclock() at netbsd:Xsoftclock+0x3c
> > --- interrupt ---
> > 0x9:
> > db> reboot
> > rebooting...
> > 
> > -- Kazushi
> > Screw up your courage!  You've screwed up everything else.
> -- 
> Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
>      NetBSD: 26 ans d'experience feront toujours la difference
> --