Subject: Re: 4.99.17 still panics on TS7250 (was: Re: 4.99.16 panics on
To: Anders Lindgren <ali@df.lth.se>
From: Chris Gilbert <chris@dokein.co.uk>
List: port-arm
Date: 04/15/2007 15:44:03
On Wed, 11 Apr 2007 01:33:12 +0200 (CEST)
Anders Lindgren <ali@df.lth.se> wrote:

>    Ok, noticing there have been a lot of updates in sys/kern etc during 
> easter, I cvs up'd tonight and rebuilt a complete distribution, put a copy 
> of the TS7200 epe0 kernel in DESTDIR, MAKEDEV'd its /dev etc and 
> TFTP-booted the corresponding netbsd-epe0.bin image. Still *boom* with an 
> unmodified kernel.
> 
>    Noticed the following interesting tidbits:
> 
>    With default TS7200 kernel, at:
> 
> ---8<---
> nfs_boot: my_addr=192.168.1.12
> nfs_boot: my_mask=255.255.255.0
> nfs_boot: gateway=192.168.1.1
> root on 192.168.1.6:/export/tsarm
> /etc/rc.conf is not configured.  Multiuser boot aborted.
> Enter pathname of shell or RETURN for /bin/sh:
> ---8<---
> 
>    If I press return or type /bin/sh, I get an immediate "locking against 
> myself" panic as described earlier.
> 
>    If I type "/bin/ksh" instead... it works.
> 
>    With an "opions LOCKDEBUG" kernel, I don't seem to get a kernel panic at 
> all; at least I can configure rc and customize some /etc files with vi, 
> create a user and set passwords, set time with ntpdate, and boot all the 
> way to multi-user and run "find /" on the entire fs without problems -- 
> seems to work ok so far. Without it, I get the mutex error panic pretty 
> much instantly on attempt to start multiuser boot.
> 
>    The lock the kernel is crashing on (via sys_read ... pipe_read) is:
> 
>   COMMON         0x00000000c0516d8c      0x118 kern_synch.o
>                  0x00000000c0516d8c                sched_mutex
> 
> ..which seems like a pretty bad thing to happen. :) I'm going to see what 
> happens if I boot a stock 3.1 release build instead later this week.
> 

I've just reproduced this on my eb7500atx.  Looking at my cats config file, it doesn't have option DIAGNOSTIC which puts in the mutex versions of sched_{un}lock_idle.  It might answer why my cats box feels to be underperforming if something is broken with scheduling, and why it also boots.

My best guess at the moment is that arm is messing up the sched_{un}lock_idle calls.  We actually do things differently to i386, eg after idling it runs at splhigh, also the unlock is much later, but interrupts are turned on.  On a compare these differences shouldn't have any affect, but perhaps they are.

I would look into it further, but as the idlelwp code is due to be merged this week, and that changes idling, it may resolve the problems.  So really it's laziness on my part that I don't want to have to disec cpu_switch :)

Once idlelwp is merged I'll try a new kernel, if the problem is still there then I'll look into it further (as my eb7500atx system is rather useless without booting up :)

Thanks,
Chris