Subject: Re: Processor correctavke error?
To: Matthew Jacob <mjacob@feral.com>
From: Chris G. Demetriou <cgd@pa.dec.com>
List: port-alpha
Date: 06/10/1998 15:07:17
> In a lot of the alpha platforms, these
> are "FYI" kinds of errors. In the TurboLaser (AlphaServer) this can
> be a report of an ECC error which then the system is expected to
> initiate an ECC memory scrub to recover from (which I'm in the
> middle of wandering towards getting done)- which has the peculiar
> effect of printing on the console "System Correctable Error"
> whereupon the system freezes and *doesn't* recover (correct). A
> loud "Not!" seems appropriate until I finish this.

So, if reported correctable errors require actual work on the part of
the OS, we should probably turn reporting off.  The existing code
should function correctly, i.e. not hang, if my understand of how
correctable error handling works.

To quote the green book:

> The MIP flag in the MCES register is set prior to invoking the machine
> check handler.  If the MIP flag is set when a machine check is being
> initiated, a double machine check halt is initiated instead.  The
> machine check handler needs to clear the MIP flag when it can handle a
> new machine check.
>
> Similarly, the SCE or PCE flag in the MCES register is set prior to
> invoking the appropriate correctable error handler.  That error
> handler should clear the appropriate correctable error in progress[sic]
> when the logout area can be reused by hardware or PALcode.  PALcode
> does not overwrite the logout area.
>
> Correctable processor of system error reporting may be suppressed by
> setting the respective DPC or DSC flag in the MCES register.  When the
> DPC or DSC flag is set, the corresponding error is corrected, but no
> correctable error interrupt is generated.

The exact behaviour, specifically whether or not the error is
corrected by the PALcode if the 'disable' bits are not set, isn't
specified (that I can tell 8-).  However, the 'but' in the last
sentence of the last paragraph indicates to me that in the
disabled-bit-not-set case, the correctable error is corrected and then
a correctable error interrupt is generated.  (I.e. "the handling is
the same, but if 'disabled' is set, no interrupt is generated.)

Reporting of correctable errors is nice, but it's not clear to me that
it's worth the trouble if the PALcode will handle them for us if and
only if we turn reporting off.



cgd