Subject: Re: Crashes and load problems: Please help!
To: Bruce Lane <kyrrin@bluefeathertech.com>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: port-sparc
Date: 03/06/2001 08:18:56
	regarding your Sparc LX.  If you haven't changed the software and it's 
starting to become unstable, you could have a memory chip or cache chip
that's going bad.  I've seen problems like this before on Suns, running
NetBSD, SunOS and Solaris.  The pattern I usually see is that errors start
cropping up every few days.  Slowly, at first, the frequency of the errors
increases until the machine becomes completely unusable.  The error you
describe sounds like a cache bug more than a memory error, but it's hard to
say given the information you have.  To test this theory, if you have
another Sun4M machine, you could transfer the hard drive from the LX to
that machine and see if it settles down.  My guess is that it will.  If you
don't, I'd suggest reseating the memory modules and blowing out any
accumulated dust to see if that causes its behavior to change.  If not,
then try switching the memory modules around in the slots to see if the
type of crashes change.  If none of these steps changes the behavior of the
machine, and the error rate continues to accelerate in frequency, then it
sounds like you're shopping for a new lx or Sparc 5.

Hope this helps.
-Brian
On Mar 6,  7:28am, Bruce Lane wrote:
} Subject: Crashes and load problems: Please help!
} Fellow Sun-worshipers, ;-)
} 
} 	I've got two issues to present for dissection today. The first is a series
} of unexplained crashes on the part of my mail server. The system is a SPARC
} LX running 1.5 with Qmail, all on 64 megs RAM and a 1.2 gig drive.
} 
} 	This box was perfectly stable for at least 2.5 months, 7x24, no reboots.
} Now, for whatever reason, it will, at random intervals, suddenly halt and
} go back to the 'ok' prompt at the console with these errors (also displayed
} on the console):
} 
} 	Watchdog Reset
} 	Instruction Access Error
} 	Type 'help' for more information
} 	ok
} 
} 	Watchdog Reset
} 	Memory Address not Aligned
} 	ok
} 
} 	I've noticed that this happens just after ntpd gives a message about a
} kernel PLL status change to 41. Is there some bug in ntpd that I didn't
} read about? The mail server -is- also designated an NTP master for my domain.
} 
} 	Anyway, after a 'boot' command at the console, it starts up just fine
} (except for stopping to clean up the root filesystem on its way), and it
} will then be stable for another few days before it does it all again.
} 
} 	All of my systems are running ntpd, and they've displayed the same bizarre
} symptom.
} 
} 	The second issue revisits the SPARC 5 I was trying to load NetBSD on last
} night. I've tried two different hard drives so far, Seagate 32430N's, that
} both came out of other Sun systems (which means they have the Sun
} firmware), but the failure is very consistent. 
} 
} 	Specifically: The initial boot and installation startup go just fine, and
} then I start getting all kinds of SCSI timeout errors from the hard drive
} when the system tries to install the sets from the CD-ROM drive.
} 
} 	I'm going to try two things to address this: First, I'm going to try a
} couple of different SCSI drives. Second, I'm going to try an FTP-based
} install.
} 
} 	Input on either issue would be much appreciated. If I need to build a
} kernel for the mail box with some sort of debug tracing in it, so be it; I
} just need to know how.
} 
} 	Thanks in advance.
} 
} 
} -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
} Bruce Lane, Owner and head honcho, Blue Feather Technologies
} http://www.bluefeathertech.com  // E-mail: kyrrin@bluefeathertech.com
} Amateur Radio: WD6EOS since Dec. '77 (Extra class as of June-2K)
} "I'll get a life when someone demonstrates to me that it would be
} superior to what I have now..." (Gym Z. Quirk, aka Taki Kogoma).
>-- End of excerpt from Bruce Lane