Subject: Re: 4.99.17 still panics on TS7250 (was: Re: 4.99.16 panics on
To: None <ali@df.lth.se, ad@netbsd.org>
From: Chris Gilbert <chris@dokein.co.uk>
List: port-arm
Date: 04/16/2007 01:56:29
On Sun, 15 Apr 2007 15:44:03 +0100
Chris Gilbert <chris@dokein.co.uk> wrote:

> On Wed, 11 Apr 2007 01:33:12 +0200 (CEST)
> Anders Lindgren <ali@df.lth.se> wrote:
>
> >    The lock the kernel is crashing on (via sys_read ... pipe_read) is:
> > 
> >   COMMON         0x00000000c0516d8c      0x118 kern_synch.o
> >                  0x00000000c0516d8c                sched_mutex
> > 
> > ..which seems like a pretty bad thing to happen. :) I'm going to see what 
> > happens if I boot a stock 3.1 release build instead later this week.
> > 
> 
> I've just reproduced this on my eb7500atx.  Looking at my cats config file, it doesn't have option DIAGNOSTIC which puts in the mutex versions of sched_{un}lock_idle.  It might answer why my cats box feels to be underperforming if something is broken with scheduling, and why it also boots.
> 
> My best guess at the moment is that arm is messing up the sched_{un}lock_idle calls.  We actually do things differently to i386, eg after idling it runs at splhigh, also the unlock is much later, but interrupts are turned on.  On a compare these differences shouldn't have any affect, but perhaps they are.
> 
> I would look into it further, but as the idlelwp code is due to be merged this week, and that changes idling, it may resolve the problems.  So really it's laziness on my part that I don't want to have to disec cpu_switch :)
> 
> Once idlelwp is merged I'll try a new kernel, if the problem is still there then I'll look into it further (as my eb7500atx system is rather useless without booting up :)

I did a bit more digging.  To try and figure out if it's MI, or MD code I placed:

	if (sched_mutex.mtx_lock == __SIMPLELOCK_LOCKED)
			panic("sched mutex should have been unlocked!");

at the end of mi_switch, and it doesn't trigger.  I believe if the MI code was wrong it should have paniced

My current best guess is that by having DIAGNOSTICs enabled, it causes the FULL code in kern_mutex.c to be used, and it's going wrong somewhere.  I need sleep before I try to get my head around that code.

Something odd I do note is that owner field is:
0x00010d00, which doesn't look like a normal value to end up in an owner field.

ad, any ideas on how to go about debugging this?

Thanks,
Chris