Subject: Re: Processor correctavke error?
To: None <cgd@pa.dec.com, mjacob@feral.com>
From: Matthew Jacob <mjacob@feral.com>
List: port-alpha
Date: 06/10/1998 15:31:39
>> In a lot of the alpha platforms, these
>> are "FYI" kinds of errors. In the TurboLaser (AlphaServer) this can
>> be a report of an ECC error which then the system is expected to
>> initiate an ECC memory scrub to recover from (which I'm in the
>> middle of wandering towards getting done)- which has the peculiar
>> effect of printing on the console "System Correctable Error"
>> whereupon the system freezes and *doesn't* recover (correct). A
>> loud "Not!" seems appropriate until I finish this.
>
>So, if reported correctable errors require actual work on the part of
>the OS, we should probably turn reporting off.  The existing code
>should function correctly, i.e. not hang, if my understand of how
>correctable error handling works.
>
>To quote the green book:
>
>> The MIP flag in the MCES register is set prior to invoking the machine
>> check handler.  If the MIP flag is set when a machine check is being
>> initiated, a double machine check halt is initiated instead.  The
>> machine check handler needs to clear the MIP flag when it can handle a
>> new machine check.
>>
>> Similarly, the SCE or PCE flag in the MCES register is set prior to
>> invoking the appropriate correctable error handler.  That error
>> handler should clear the appropriate correctable error in progress[sic]
>> when the logout area can be reused by hardware or PALcode.  PALcode
>> does not overwrite the logout area.
>>
>> Correctable processor of system error reporting may be suppressed by
>> setting the respective DPC or DSC flag in the MCES register.  When the
>> DPC or DSC flag is set, the corresponding error is corrected, but no
>> correctable error interrupt is generated.
>
>The exact behaviour, specifically whether or not the error is
>corrected by the PALcode if the 'disable' bits are not set, isn't
>specified (that I can tell 8-).  However, the 'but' in the last
>sentence of the last paragraph indicates to me that in the
>disabled-bit-not-set case, the correctable error is corrected and then
>a correctable error interrupt is generated.  (I.e. "the handling is
>the same, but if 'disabled' is set, no interrupt is generated.)
>
>Reporting of correctable errors is nice, but it's not clear to me that
>it's worth the trouble if the PALcode will handle them for us if and
>only if we turn reporting off.

Chris - I'll have to ponder this. Ultimately, most of this stuff
will get covered under DIAGNOSTIC anyway (for reporting to the console
about errors), but I really can't quite believe that you've just
made an argument that goes 'Disable Correctable Error Reporting
and All Will Be Well'- which is how I have (mis?)understood your
mail to read.