[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Some alpha problems in current
On Wed, 6 Feb 2008, Jarle Greipsland wrote:
The machine survived 16 hours (until 04:55), at which point it paniced
with "fpsave ipi didn't". CPU 1 was hung and I couldn't get a backtrace
from it. This would also explain the panic - CPU 0 is waiting for the
other CPU to process the fpusave ipi, but it's probably spinning somewhere
and unable to process the ipi. This type of panic used to be fairly
frequent before, but I was able to track down the deadlock at the time and
that particular problem was fixed. It looks like it may be back again.
This is all very similar to the problems I reported in
I'd say it's the same problem.
I know what's happening, but not why or how to fix it.
The alpha uses lazy FP switching and doesn't save the FPU context when a
cpu switches processes. When the process begins executing again, the FPU
is disabled, and a trap will occur when it tries to execute an FP
instruction. If the process is running on the same CPU that currently has
the FPU context, it just enables the FPU and continues. If it's on a
different CPU, then the other CPU is told to save the CPU context so the
current CPU can restore it and continue. If the other CPU fails to
respond within a certain amount of time, you get the "fpsave ipi didn't"
panic. This ususally means the other CPU has the IPI interrupt blocked
and may be spinning somewhere in the kernel (it could be blocked by the
current CPU, which results in a deadlock). The mutex routines are missing
some checks that will allow the alpha to check for IPI requests and
process them even when the interrupts are blocked, but even after I added
them I still saw the same panic (as well as hangs in the tlb_shootdown
Because the other CPU has blocked the IPI interrupts and spinning
somewhere, the CPU won't get paused (via an IPI request) and you can't get
any context in DDB, and the CPU won't halt (again via an IPI request) when
halting the sytem. In the halt case, if it's the primary CPU requesting
the halt, it will give up after a short while and halt, but leaves the
secondary CPU running which will cause problems if you attempt to boot
without forcing it to halt via SRM. Currently, if a secondary CPU tells
the primary to halt, it will spin waiting for all secondary CPUs to halt
before it halts - but with no timeout to give up and will sit there
forever until forced externally (halt switch, reset switch, or power).
I'm testing a fix now to have the primary CPU give up after a while when
it's told via an IPI request to halt. It will still leave one of the
secondary CPUs running, but should at least get back to SRM in that case.
Michael L. Hitch mhitch%NetBSD.org@localhost
Main Index |
Thread Index |