Re: Some alpha problems in current

To: Jarle Greipsland <jarle%uninett.no@localhost>
Subject: Re: Some alpha problems in current
From: "Michael L. Hitch" <mhitch%NetBSD.org@localhost>
Date: Wed, 6 Feb 2008 11:51:08 -0700 (MST)

On Wed, 6 Feb 2008, Jarle Greipsland wrote:

   The machine survived 16 hours (until 04:55), at which point it paniced
with "fpsave ipi didn't".  CPU 1 was hung and I couldn't get a backtrace
from it.  This would also explain the panic - CPU 0 is waiting for the
other CPU to process the fpusave ipi, but it's probably spinning somewhere
and unable to process the ipi.  This type of panic used to be fairly
frequent before, but I was able to track down the deadlock at the time and
that particular problem was fixed.  It looks like it may be back again.

This is all very similar to the problems I reported in
port-alpha/37712.


  I'd say it's the same problem.

  I know what's happening, but not why or how to fix it.

The alpha uses lazy FP switching and doesn't save the FPU context when acpu switches processes. When the process begins executing again, the FPUis disabled, and a trap will occur when it tries to execute an FPinstruction. If the process is running on the same CPU that currently hasthe FPU context, it just enables the FPU and continues. If it's on adifferent CPU, then the other CPU is told to save the CPU context so thecurrent CPU can restore it and continue. If the other CPU fails torespond within a certain amount of time, you get the "fpsave ipi didn't"panic. This ususally means the other CPU has the IPI interrupt blockedand may be spinning somewhere in the kernel (it could be blocked by thecurrent CPU, which results in a deadlock). The mutex routines are missingsome checks that will allow the alpha to check for IPI requests andprocess them even when the interrupts are blocked, but even after I addedthem I still saw the same panic (as well as hangs in the tlb_shootdown

code).

Because the other CPU has blocked the IPI interrupts and spinningsomewhere, the CPU won't get paused (via an IPI request) and you can't getany context in DDB, and the CPU won't halt (again via an IPI request) whenhalting the sytem. In the halt case, if it's the primary CPU requestingthe halt, it will give up after a short while and halt, but leaves thesecondary CPU running which will cause problems if you attempt to bootwithout forcing it to halt via SRM. Currently, if a secondary CPU tellsthe primary to halt, it will spin waiting for all secondary CPUs to haltbefore it halts - but with no timeout to give up and will sit thereforever until forced externally (halt switch, reset switch, or power).I'm testing a fix now to have the primary CPU give up after a while whenit's told via an IPI request to halt. It will still leave one of thesecondary CPUs running, but should at least get back to SRM in that case.


--
Michael L. Hitch                        mhitch%NetBSD.org@localhost

Follow-Ups:
- Re: Some alpha problems in current
  - From: Jarle Greipsland

References:
- Re: Some alpha problems in current
  - From: Michael L. Hitch
- Re: Some alpha problems in current
  - From: Michael L. Hitch
- Re: Some alpha problems in current
  - From: Michael L. Hitch
- Re: Some alpha problems in current
  - From: Jarle Greipsland

Prev by Date: Re: Some alpha problems in current
Next by Date: Self baked kernel panics
Previous by Thread: Re: Some alpha problems in current
Next by Thread: Re: Some alpha problems in current
Indexes:

Home | Main Index | Thread Index | Old Index