Subject: Re: Processor correctavke error?
To: None <port-alpha@NetBSD.ORG>
From: Ross Harvey <ross@teraflop.com>
List: port-alpha
Date: 06/11/1998 14:09:57
Regarding correctible errors...
cgd> ...
cgd> I interpreted the surrounding text to mean that if reporting is not
cgd> disabled, they'll still be corrected, and that additionally the error
cgd> will be reported. However, that interpretation may be incorrect.
cgd>
cgd> If my interpretation was incorrect (and some ideas on the matter from
cgd> those more familiar with PALcode would help; Ross?), then you're faced
cgd> with a tradeoff:
cgd> ...
You were correct, it doesn't even check mces$v_dpc until after attempting
a scrub. I didn't check ev4 palcode, but it's clear that the intended meaning
of the DPC bit is to suppress only the report interrupt.
Regarding uncorrectable errors:
mts> ok, here's another machine check. This one comes from one of the EB164's
mts> i'm playing with, all of them exhibit the same machine check. It occurs
mts> when working with the second aha2940...
Based on this information our stock answer ("bad memory") obviously doesn't
apply, it appears to be a PCI error. Did you get a chance to try Jason's
new bwx kernel? How does that work? If it is stilling losing and is
repeatable, you may need to use the dbx hints you already have to get
started and resort to inserting printf's to localize the failure.