Subject: Re: 4.99.17 still panics on TS7250
To: None <ali@df.lth.se>
From: Chris Gilbert <chris@dokein.co.uk>
List: port-arm
Date: 04/12/2007 00:30:54
On Wed, 11 Apr 2007 01:20:05 +0100
Chris Gilbert <chris@dokein.co.uk> wrote:

> Anders Lindgren wrote:
> >   Ok, noticing there have been a lot of updates in sys/kern etc during
> > easter, I cvs up'd tonight and rebuilt a complete distribution, put a
> > copy of the TS7200 epe0 kernel in DESTDIR, MAKEDEV'd its /dev etc and
> > TFTP-booted the corresponding netbsd-epe0.bin image. Still *boom* with
> > an unmodified kernel.
> > 
> >   Noticed the following interesting tidbits:
> > 
> >   With default TS7200 kernel, at:
> > 
> > ---8<---
> > nfs_boot: my_addr=192.168.1.12
> > nfs_boot: my_mask=255.255.255.0
> > nfs_boot: gateway=192.168.1.1
> > root on 192.168.1.6:/export/tsarm
> > /etc/rc.conf is not configured.  Multiuser boot aborted.
> > Enter pathname of shell or RETURN for /bin/sh:
> > ---8<---
> > 
> >   If I press return or type /bin/sh, I get an immediate "locking against
> > myself" panic as described earlier.
> > 
> >   If I type "/bin/ksh" instead... it works.
> > 
> >   With an "opions LOCKDEBUG" kernel, I don't seem to get a kernel panic
> > at all; at least I can configure rc and customize some /etc files with
> > vi, create a user and set passwords, set time with ntpdate, and boot all
> > the way to multi-user and run "find /" on the entire fs without problems
> > -- seems to work ok so far. Without it, I get the mutex error panic
> > pretty much instantly on attempt to start multiuser boot.
> > 
> >   The lock the kernel is crashing on (via sys_read ... pipe_read) is:
> > 
> >  COMMON         0x00000000c0516d8c      0x118 kern_synch.o
> >                 0x00000000c0516d8c                sched_mutex
> > 
> > ..which seems like a pretty bad thing to happen. :) I'm going to see
> > what happens if I boot a stock 3.1 release build instead later this week.
> > 
> >   Any help on how to proceed from here greatly appreciated.
> 
> My best guess is that we've messed up locking on arm somewhere. I'll try
> to get time to fully boot an arm box with -current and see if I can
> repro this.
> 
> It's odd that LOCKDEBUG makes it go away which suggests a timing issue,
> the only arm code I can find that uses LOCKDEBUG is the pmap code, and
> this seem unrelated to to that.  Although it's worth a shot, add a
> #define LOCKDEBUG to pmap.c and see if the problems go away.
> 
> cpuswitch.S does make calls to sched_lock and unlock, but I'm not sure
> if this is the same mutex or not.
> 
> Might also be worth asking on tech-kern, see if anyone else has seen this.

I've just sync'd and built a fresh cats kernel, I'm not seeing any problems (I probably need to update the userland as it's probably quite old)

can you try a non-lockdebug kernel with:
option ARM_LOCK_CAS_DEBUG

It'll enable some event counters, which you'll be able to see in ddb with show event

No idea if it'll help, but it might provide a bit more information.

Are you able to try with a local/usb disk, rather than nfs?  See if it's something to do with nfs.

Thanks,
Chris