port-arm: Help needed to fix NetBSD/shark

Subject: Help needed to fix NetBSD/shark
To: None <port-arm@NetBSD.org>
From: Julio M. Merino Vidal <jmmv84@gmail.com>
List: port-arm
Date: 07/30/2007 14:07:29
Hello,

[ CC'ing Jason Thorpe, as he's the NetBSD/shark port master ]
[ Also please CC me any replies ]

A recent change [1] in the NFS code has broken NetBSD/shark on  
(supposedly) all configurations.  The thing is that this change has  
only exposed a long-standing bug [2] that appeared when building  
shark kernels without NFS support.

The problem exposes as follows: the kernel boots fine but, once it  
has to hand out control to userlevel (spawn init), it locks  
completely.  The disk gets stalled and the keyboard is non- 
responsive.  According to the PR, delivering events through a serial  
console may show some progress (which I believe and will explain  
below), but I don't have one to try.  However, booting with 'boot -a'  
lets the user properly specify the boot device (i.e. the keyboard  
works fine at that point, as we are still in the kernel).

I've been trying to track down the problem but, given that I'm not  
familiar at all with ARM nor the shark's hardware (not to mention the  
kernel's code in this area), I'm lost.  The most probable cause seems  
to lie in the interrupts handling code, which is probably broken in  
some place.

Here are some things I tried:

- Added a printf at the top of clockintr.  At some point the function
   is not called any more; the machine seems to discard the clock
   interrupts.

- Programming a repeating callout with a timeout of 1 (has to be 1,
   a higher number does not work) makes the machine work.  I deduce
   from this that this soft interrupt, which fires at each clock tick,
   "reenables" interrupts correctly on exit so that the next clock
   tick is properly received.  As at that point another callout has
   to be handled, the clock is properly reenabled at each step.  Using
   a timeout of, e.g. 2, makes the clock get stalled at the tick that
   no callout has to be handled.

- Based on the above, yamt@ mentioned "hardclock without
   softintr_schedule is broken?".  If I make "needsoftclock" be always
   true in kern_timeout.c:callout_hardclock, the machine works fine.
   Similarly, adding "_setsoftintr(0)" immediately after the call to
   hardclock() from clockintr() does the same.  Which is basically a
   simplification of the previous case.

- Making clockintr happen at IPL_HIGH instead of IPL_CLOCK (by
   changing ithe call to isa_intr_establish in clock.c) makes the
   thing work too.

- In isa_irqhandler.c:irq_calculatemasks: if I change IPL_CLOCK to
   "happen" above IPL_HIGH as follows, the machine works:

   irqmasks[IPL_STATCLOCK] &= irqmasks[IPL_AUDIO];
   irqmasks[IPL_HIGH] &= irqmasks[IPL_STATCLOCK];
   irqmasks[IPL_CLOCK] &= irqmasks[IPL_HIGH];

   Note that IPL_SERIAL is already defined *after* IPL_HIGH, which is
   why the submitter of the PR was able to see some progress by
   generating serial activity.

   Similarly, if I move the IPL_TTY to be after IPL_HIGH (but without
   moving IPL_CLOCK), the keyboard works when we are in userland --
   But the disk doesn't.  Oh well, or maybe I'm just impatient because
   the keyboard interrupts are slow.

I think I'm not forgetting anything and I hope that someone familiar
with the platform will be able to quickly spot the problem.

Any help (or fix!) will be highly appreciated.

Thank you.

1: http://mail-index.netbsd.org/source-changes/2007/07/27/0016.html
2: PR port-shark/22355

-- 
Julio M. Merino Vidal <jmmv84@gmail.com>