Subject: Re: what's this machine check mean?
To: None <port-alpha@netbsd.org>
From: Jason R Thorpe <thorpej@zembu.com>
List: port-alpha
Date: 04/15/2000 10:42:09
On Sat, Apr 15, 2000 at 03:33:09AM -0400, der Mouse wrote:

 > This finished - yay, it actually finished in finite time!  I've now had
 > three machine checks, two at 0xfffffc000050f3c8 and one at
 > 0xfffffc000050f3e4.

One thing to remember about machine checks is that they're not precise
traps.  Without doing a trap barrier explicitly, you may not get the
exception until several instructions later.

 > 0xfffffc000050f3c8 is lca_mem_read_1+0x48 (+72)
 > 0xfffffc000050f3e4 is lca_mem_read_1+0x64 (+100)

...

 > It clearly means something that they're all in that function, two of
 > them at the same address even, but darned if I can tell what.  What's
 > this LCA thing?  It looks as though it's some kind of bus, but the
 > comments about "can have only one" make it sound like a semi-fake bus.
 > But on the other hand it seems to correspond to a real chip.

LCA stands for "Low Cost Alpha", i.e. the 21066 and 21068 chips.  These
are Alpha CPUs with built-in PCI bus interfaces.  `lca' is the thing that
the primary `pci' is logically attached to.  The routine you're seeing
is the one that does 1 byte reads from PCI memory space.

I'm guessing you have a Multia.  As much as I hate to say it, I'm pretty
sure your Multia is experiencing Multia Heat Death.  The machine checks
are an early sign.  There's not really much you can do about it at this
point besides apply the hardware fix, which is documented on the NetBSD/alpha
web pages.

-- 
        -- Jason R. Thorpe <thorpej@zembu.com>