Subject: Re: Processor correctavke error?
To: None <mjacob@feral.com>
From: Ross Harvey <ross@teraflop.com>
List: port-alpha
Date: 06/11/1998 16:28:37
> From mjacob@feral.com Thu Jun 11 15:30:17 1998
>
> I still want to know whether, for correctable errors, the kernel should
> be doing any "corrective" action.
>

The only error that is really "correctible" is the CRD machine check, where
the on-chip ECC fixes a single bit error from DRAM.

The simple answer is: no, the interrupt is just so you can log it, in fact,
the hardware has already corrected the bad bits on the bus transaction that
triggered the sequence before the palcode even started. The palcode then
scrubs the cache line, to force a write, really a copyback, and then forces
a fill.  Finally, with everything done, it interrupts system software.

The more complicated answer is: well, there are some things that have a very
marginal benefit and some possible drawbacks that NetBSD _could_ be doing.
We could stop logging, for one thing.  We could turn off the machine check
to save the time in locore.s and interrupt.c. This doesn't help as much as
you might think because the palcode trap still happens and still has to do
the rather expensive (because it is non-cachable) scrub op.

Some of the chips, ev5*, e.g., have a bit (in BC_CONTROL) that does
flow-through error correction transparently. We _could_ look at what CPU we
are on and switch into this mode if the error frequency is too high, and it
would go faster and _might_ make a recursive machine check less likely.

But on the downside, we would be operating below the nice architecture layer,
on something that might not be a win, and which is only relevant when the
hardware needs fixing anyway, and these days there isn't much pressure to
use RAM that is bad. :-) Also, the BC_CONTROL register is write-only, so we
have to do some highly dangerous magic just to get an _alleged_ copy of the
bits before blindly clearing one of them and writing a register that affects
all kinds of critical timing parameters.