Subject: Re: is it hardware or software thats broken?
To: None <mouse@Rodents.Montreal.QC.CA>
From: Johan A.van Zanten <johan@giantfoo.org>
List: port-sparc
Date: 06/25/2006 16:53:21
I, johan, originally wrote:
> > Suns have ECC memory, which is much more "robust" than PeeCees.  Most
> > memory errors are detected and corrected, rather than silenty causing
> > problems with running processes.

der Mouse <mouse@Rodents.Montreal.QC.CA> replied:
> Saying this is true of "Suns" is a bit misleading.  There probably are
> Suns that won't work without ECC memory.  Some support it but will work
> with basic parity memory.  Some will even take non-parity memory.  I
> don't know all the combinations, but I am quite sure that there are
> Suns that don't do ECC, and some that do ECC if you put ECC memory in
> them but will work with other memory.

 My apologies for being overly general -- of course you're correct.  I
believe when i wrote that paragraph i was thinking "parity" when i wrote
"ECC."  As you've pointed out, some Suns will even take non-parity memory,
though i can't recall a machine that Sun ever shipped with non-parity
memory.

 The main idea i was failing to communicate is that Suns (generally) have
some sort of error checking on RAM, which is different than x86-type
hardware, which at the time, often has none, because it was cheaper.  So
the OS running on a Sun at least has the opportunity to be told that a
stick of RAM is behaving badly, and can handle or log the error rather
than passing the bad data into userland or crashing.

 Although, upon further reflection, my comments are somewhat useless,
because i'm really comparing sun4m (1993-2000ish) with same-period models
of x86-based hardware.  Most people reading this would probably be
thinking of more modern x86-ish hardware that now uses ECC memory.

 And finally, just to totally invalidate my original comments, a week ago,
at my day job, we just had a problem with a Sun V240 (busy webserver),
where SunOS was randomly crashing.  It appears the problem was a bad DIMM,
but interestingly, it wasn't detected by POST (power on self test) unless
"diag-level" was set to "max."  (The kernel was throwing cryptic hardware
errors when it paniced, but they weren't naming a bad memory module.  The
panic string reminded me more of crashes involving bad Ecache on a CPU.)

 -johan