Subject: Re: 4.99.17 still panics on TS7250
To: Anders Lindgren <ali@df.lth.se>
From: Chris Gilbert <chris@dokein.co.uk>
List: port-arm
Date: 04/11/2007 01:20:05
Anders Lindgren wrote:
>   Ok, noticing there have been a lot of updates in sys/kern etc during
> easter, I cvs up'd tonight and rebuilt a complete distribution, put a
> copy of the TS7200 epe0 kernel in DESTDIR, MAKEDEV'd its /dev etc and
> TFTP-booted the corresponding netbsd-epe0.bin image. Still *boom* with
> an unmodified kernel.
> 
>   Noticed the following interesting tidbits:
> 
>   With default TS7200 kernel, at:
> 
> ---8<---
> nfs_boot: my_addr=192.168.1.12
> nfs_boot: my_mask=255.255.255.0
> nfs_boot: gateway=192.168.1.1
> root on 192.168.1.6:/export/tsarm
> /etc/rc.conf is not configured.  Multiuser boot aborted.
> Enter pathname of shell or RETURN for /bin/sh:
> ---8<---
> 
>   If I press return or type /bin/sh, I get an immediate "locking against
> myself" panic as described earlier.
> 
>   If I type "/bin/ksh" instead... it works.
> 
>   With an "opions LOCKDEBUG" kernel, I don't seem to get a kernel panic
> at all; at least I can configure rc and customize some /etc files with
> vi, create a user and set passwords, set time with ntpdate, and boot all
> the way to multi-user and run "find /" on the entire fs without problems
> -- seems to work ok so far. Without it, I get the mutex error panic
> pretty much instantly on attempt to start multiuser boot.
> 
>   The lock the kernel is crashing on (via sys_read ... pipe_read) is:
> 
>  COMMON         0x00000000c0516d8c      0x118 kern_synch.o
>                 0x00000000c0516d8c                sched_mutex
> 
> ..which seems like a pretty bad thing to happen. :) I'm going to see
> what happens if I boot a stock 3.1 release build instead later this week.
> 
>   Any help on how to proceed from here greatly appreciated.

My best guess is that we've messed up locking on arm somewhere. I'll try
to get time to fully boot an arm box with -current and see if I can
repro this.

It's odd that LOCKDEBUG makes it go away which suggests a timing issue,
the only arm code I can find that uses LOCKDEBUG is the pmap code, and
this seem unrelated to to that.  Although it's worth a shot, add a
#define LOCKDEBUG to pmap.c and see if the problems go away.

cpuswitch.S does make calls to sched_lock and unlock, but I'm not sure
if this is the same mutex or not.

Might also be worth asking on tech-kern, see if anyone else has seen this.

Thanks,
Chris