Subject: Help needed to fix NetBSD/shark
To: None <port-arm@NetBSD.org>
From: Julio M. Merino Vidal <firstname.lastname@example.org>
Date: 07/30/2007 14:07:29
[ CC'ing Jason Thorpe, as he's the NetBSD/shark port master ]
[ Also please CC me any replies ]
A recent change  in the NFS code has broken NetBSD/shark on
(supposedly) all configurations. The thing is that this change has
only exposed a long-standing bug  that appeared when building
shark kernels without NFS support.
The problem exposes as follows: the kernel boots fine but, once it
has to hand out control to userlevel (spawn init), it locks
completely. The disk gets stalled and the keyboard is non-
responsive. According to the PR, delivering events through a serial
console may show some progress (which I believe and will explain
below), but I don't have one to try. However, booting with 'boot -a'
lets the user properly specify the boot device (i.e. the keyboard
works fine at that point, as we are still in the kernel).
I've been trying to track down the problem but, given that I'm not
familiar at all with ARM nor the shark's hardware (not to mention the
kernel's code in this area), I'm lost. The most probable cause seems
to lie in the interrupts handling code, which is probably broken in
Here are some things I tried:
- Added a printf at the top of clockintr. At some point the function
is not called any more; the machine seems to discard the clock
- Programming a repeating callout with a timeout of 1 (has to be 1,
a higher number does not work) makes the machine work. I deduce
from this that this soft interrupt, which fires at each clock tick,
"reenables" interrupts correctly on exit so that the next clock
tick is properly received. As at that point another callout has
to be handled, the clock is properly reenabled at each step. Using
a timeout of, e.g. 2, makes the clock get stalled at the tick that
no callout has to be handled.
- Based on the above, yamt@ mentioned "hardclock without
softintr_schedule is broken?". If I make "needsoftclock" be always
true in kern_timeout.c:callout_hardclock, the machine works fine.
Similarly, adding "_setsoftintr(0)" immediately after the call to
hardclock() from clockintr() does the same. Which is basically a
simplification of the previous case.
- Making clockintr happen at IPL_HIGH instead of IPL_CLOCK (by
changing ithe call to isa_intr_establish in clock.c) makes the
thing work too.
- In isa_irqhandler.c:irq_calculatemasks: if I change IPL_CLOCK to
"happen" above IPL_HIGH as follows, the machine works:
irqmasks[IPL_STATCLOCK] &= irqmasks[IPL_AUDIO];
irqmasks[IPL_HIGH] &= irqmasks[IPL_STATCLOCK];
irqmasks[IPL_CLOCK] &= irqmasks[IPL_HIGH];
Note that IPL_SERIAL is already defined *after* IPL_HIGH, which is
why the submitter of the PR was able to see some progress by
generating serial activity.
Similarly, if I move the IPL_TTY to be after IPL_HIGH (but without
moving IPL_CLOCK), the keyboard works when we are in userland --
But the disk doesn't. Oh well, or maybe I'm just impatient because
the keyboard interrupts are slow.
I think I'm not forgetting anything and I hope that someone familiar
with the platform will be able to quickly spot the problem.
Any help (or fix!) will be highly appreciated.
2: PR port-shark/22355
Julio M. Merino Vidal <email@example.com>