Subject: Re: Decoding machine checks...
To: Matt Thomas <matt@3am-software.com>
From: None <kpneal@pobox.com>
List: port-alpha
Date: 09/13/2003 01:30:35
On Fri, Sep 12, 2003 at 09:12:01PM -0700, Matt Thomas wrote:
> 
> On Friday, September 12, 2003, at 06:09 PM, Sean Davis wrote:
> >Machine checks typically mean hardware is dying somewhere, from 
> >everything
> >I've read.
> 
> You haven't read enough.  machine checks happen when the processor
> encounters an unrecoverable error (no kernel stack avail, etc.)
> 
> 660 usually means you tried to access memory that isn't there
> or in a manner it doesn't support.

Wouldn't the middle of a dump be an odd time for that to happen?

Is it possible one of the four sticks of memory is the wrong type
perhaps? (I didn't put the machine together, and didn't take it
apart when I got it.)

Also, wouldn't "processor correctable error" messages tend to go
along with the "dying hardware" theory? These "correctable" errors
sometimes cause a crash, perhaps 1 in 100 times.

I'll see if I can get the initial panic message next time it
happens. 


Here's the tail end of a connection to the box after a crash when
I was looking at the dmesg, right up until it crashed again.

Warning: received processor correctable error.
Warning: received processor correctable error.
Warning: received processor correctable error.
rune% dmesg | grep Warning | wc -l
     164
rune% Read from remote host rune: Connection reset by peer

-- 
Kevin P. Neal                                http://www.pobox.com/~kpn/

   "I like being on the Daily Show." - Kermit the Frog, Feb 13 2001