On Wed, 6 Feb 2008, Jarle Greipsland wrote:
The machine survived 16 hours (until 04:55), at which point it paniced with "fpsave ipi didn't". CPU 1 was hung and I couldn't get a backtrace from it. This would also explain the panic - CPU 0 is waiting for the other CPU to process the fpusave ipi, but it's probably spinning somewhere and unable to process the ipi. This type of panic used to be fairly frequent before, but I was able to track down the deadlock at the time and that particular problem was fixed. It looks like it may be back again.This is all very similar to the problems I reported in port-alpha/37712.
I'd say it's the same problem. I know what's happening, but not why or how to fix it.The alpha uses lazy FP switching and doesn't save the FPU context when a cpu switches processes. When the process begins executing again, the FPU is disabled, and a trap will occur when it tries to execute an FP instruction. If the process is running on the same CPU that currently has the FPU context, it just enables the FPU and continues. If it's on a different CPU, then the other CPU is told to save the CPU context so the current CPU can restore it and continue. If the other CPU fails to respond within a certain amount of time, you get the "fpsave ipi didn't" panic. This ususally means the other CPU has the IPI interrupt blocked and may be spinning somewhere in the kernel (it could be blocked by the current CPU, which results in a deadlock). The mutex routines are missing some checks that will allow the alpha to check for IPI requests and process them even when the interrupts are blocked, but even after I added them I still saw the same panic (as well as hangs in the tlb_shootdown
code).Because the other CPU has blocked the IPI interrupts and spinning somewhere, the CPU won't get paused (via an IPI request) and you can't get any context in DDB, and the CPU won't halt (again via an IPI request) when halting the sytem. In the halt case, if it's the primary CPU requesting the halt, it will give up after a short while and halt, but leaves the secondary CPU running which will cause problems if you attempt to boot without forcing it to halt via SRM. Currently, if a secondary CPU tells the primary to halt, it will spin waiting for all secondary CPUs to halt before it halts - but with no timeout to give up and will sit there forever until forced externally (halt switch, reset switch, or power). I'm testing a fix now to have the primary CPU give up after a while when it's told via an IPI request to halt. It will still leave one of the secondary CPUs running, but should at least get back to SRM in that case.
-- Michael L. Hitch mhitch%NetBSD.org@localhost