Subject: "fatal machine check or error (unknown type)" from power supply issues?
To: NetBSD/Alpha Discussion List <>
From: Greg A. Woods <>
List: port-alpha
Date: 12/01/2006 10:54:41
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

So, the following has happened on my customer's ES40 a couple of times
now, and the second crash coincided exactly with the removal of power
from one of its three (full N+1) power supplies, and now there's a good
deal of certainty that the last machine check panic also coincided with
power problems (they've been rebuilding their datacentre UPS and moving
machines back and forth between power sources):

fatal machine check or error (unknown type):

    mces    =3D 0x0
    vector  =3D 0x680
    param   =3D 0xfffffc0000006148
    pc      =3D 0xfffffc00003bde04
    ra      =3D 0xfffffc00003d3e14
    code    =3D 0x100000206
    curproc =3D 0xfffffc00f5000008
        pid =3D 29554, comm =3D pop3d

panic: machine check
Stopped in pid 29554 (pop3d) at cpu_Debugger+0x4:       ret     zero,(ra)

As you can see though the mces value is zero, leaving the code nothing
to decode to determine the cause of the interrupt.

Is it possible there's some other value, besides what alpha_pal_rdmces()
returns, which should also be examined on these newer machines?

						Greg A. Woods
						Planix, Inc.

<>     +1 416 489-5852 x122

Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

Version: PGPfreeware 5.0i for non-commercial use
MessageID: uhtyE6dSv4koe/BoDRkCDyE0YoL9TxaQ