Subject: Re: port-alpha/5546: port-alpha/lost a stack? exception_restore_regs bombs
To: Ross Harvey <ross@teraflop.com>
From: Chris G. Demetriou <cgd@pa.dec.com>
List: port-alpha
Date: 06/05/1998 17:21:07
Ross said:
> The short version of the story is:
> 
> 	"I think it's a hardware problem."

Many sources of mchecks are, in fact, hardware errors.  However, i'm
not at all convinced that this one would be.  (I've no evidence to
back that up, but I don't think i've any evidence to back up any
contrary claim, either, since nobody's interpreted that logout frame
for us all.  8-)

The rest of this message is a WAG, but it might be a useful WAG.


Anyway, some comments on other stuff in this thread:

Matt said:
> The PC decodes as:
>
> (gdb) x/i 0xfffffc0000300354
> 0xfffffc0000300354 <exception_restore_regs>:    ldq     v0,0(sp)

this is likely to be a red herring.  Most, or at least many, machine
checks are asynchronous.  (There are some exceptions, and i'll go into
that -- and my theory -- in a few lines.)


Jason said:
> You also need to print out the platform-specific (i.e. KN8AE) logout area,
> which will have the ECC memory error information, etc.

I think this is the key, but not because of memory error information.
(As noted, ECC-fixable problems would have been reported as
correctable errors, not machine checks.)

I think this platform-specific information will give you lots of
useful information about your I/O controllers, which is where i'd
guess the problem lies.

In particular, note that PCI aborts, such as when doing configuration
(or other) accesses to space which doesn't exist, can yeild machine
checks (typically vector 660, if i recall, but my memory is rather
fuzzy in that area).  Note that these machine checks are expected,
taken, and survived!

I would expect that similarly, devices trying to DMA to SG-mapped
space where the translation is invalid (or bogus) could have a similar
effect.


I would suggest:

(1) dumping all of the information you can from the platform-specific
logout area.  That is, not just numbers, but interpretation of numbers
too.  That and the rest of the mcheck information _should_ tell you
exactly what caused death, _if_ you can interpret it.

(2) if you can't squeeze meaningful information out of the logout
frames, i'd strongly suggest dumping all of the information you can
about device DMA state, both from I/O controller registers and kernel
data structures.  Sure, it's only a WAG, but it's a somewhat-educated
WAG, and lacking better information seems like a good candidate for
trouble, _especially_ if you have lots of DMA accesses going on.


Note that if you're not actually using SG-mapped DMA for the things
you're beating on, on that system, then my WAG is more WA and less
likely to be a valid G.  8-)




cgd