Subject: Re: mutex fault
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Kazushi (Jam) Marukawa <jam@pobox.com>
List: port-xen
Date: 11/26/2007 20:24:45
Hi Manuel,

Thanks for checking codes.  As you requested, I ran the new
kernel after compiling with -g option (call it netbsd1122g)
and got crash trase of it.  Unfortunately, it seems the
same.  The address is also the same, callout_softclock+0x24d.

Disasseble of callout_softclock shows following (0x24d=589):

  0xc043151a <callout_softclock+586>:     call   *0xffffffec(%ebp)
  0xc043151d <callout_softclock+589>:     movl   $0xc098d594,(%esp)
  0xc0431524 <callout_softclock+596>:     call   0xc04db580 <mutex_spin_enter>

Line numbers are:
  callout_softclock+586: kern/kern_timeout.c 604
  callout_softclock+589: kern/kern_timeout.c 605
  callout_softclock+596: kern/kern_timeout.c 605

And here is the source code.
  602                                     KERNEL_UNLOCK_ONE(curlwp);
  603                             } else
  604                                             (*func)(arg);
  605                             mutex_spin_enter(&callout_lock);
  606
  607                             /*
  608                              * We can't touch 'c' here because it might be
  609                              * freed already.  If LWPs waiting for callout

   > it seems it called mutex_spin_enter instead
   > of mutex_spin_exit.

The return address points the address where the program goes
back, so it points the next address of what it was executing
in the previous stack frame.

So, I guess "func" has a pointer to the mutex_vector_exit
function.  And kernel called it at line 604 above.  Then,
the kernel crashed inside the function pointed by "func
(=mutex_vector_exit())".

I'm not sure why or how "func" gets pointer to the
mutex_vector_exit, though.  Maybe several mutex calls were
overlapped?


Anyway, here is a new stack trace from the kernel compiled
with "-g".  It's almost the same as the previous one.

----
Mutex error: mutex_vector_exit: exiting unheld spin mutex

lock address : 0x00000000c098dde0
current cpu  :                  0
current lwp  : 0x00000000ca7a5e00
owner field  : 0x0000000000000b00 wait/spin:                0/1

panic: lock error
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c04ca539 cs 9 eflags 246 cr2 0 ilevel b
Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x1:  ret
db> trace
breakpoint(c0879e8d,c0876aa5,c0701570,c0876a41,c098dde0) at netbsd:breakpoint+0x1
lockdebug_abort(c098dde0,c08e72ac,c0701570,c0876a41,cd48826c) at netbsd:lockdebug_abort+0x61
mutex_abort(c098dde0,c0701570,c0876a41,c04db1d7,cd48826c) at netbsd:mutex_abort+0x34
mutex_vector_exit(c098dde0,ca003b9a,cafb6f78,c04db275,c092a560) at netbsd:mutex_vector_exit+0xcf
callout_softclock(0,0,cafb6fb8,c04dbf26,0) at netbsd:callout_softclock+0x24d
softintr_dispatch(0,cafb6fa8,c0ae000f,f,fffff000) at netbsd:softintr_dispatch+0x3f
DDB lost frame for netbsd:Xsoftclock+0x3c, trying 0xcafb6fa0
Xsoftclock() at netbsd:Xsoftclock+0x3c
--- interrupt ---
--- switch to interrupt stack ---
?(31,ca000011,ca780011,0,35f3) at 0x1

   On Nov 26,  0:04, Manuel Bouyer wrote:
   > Subject: Re: mutex fault
   > 
   > Thanks. Can you reproduce the panic with this kernel (in case it's not
   > at callout_softclock+0x24d any more), and also map the address to a
   > line number in C code ?  it's
   > l *(callout_softclock+0x24d)
   > in gdb (easier than doing dissassembly by hand :)
   > 
   > From this it's the same as I get, but it's not consistent with the
   > backtrace you posted earlier ... it seems it called mutex_spin_enter instead
   > of mutex_spin_exit.
   > 
   > -- 
   > Manuel Bouyer <bouyer@antioche.eu.org>
   >      NetBSD: 26 ans d'experience feront toujours la difference
   > --
   >-- End of excerpt from Manuel Bouyer

-- Kazushi
Last week a cop stopped me in my car.  He asked me if I had a police
record.  I said, no, but I have the new DEVO album.  Cops have no sense
of humor.