port-sparc: Re: Hardware questions

Subject: Re: Hardware questions
To: Don Yuniskis <auryn@gci-net.com>
From: Greg A. Woods <woods@weird.com>
List: port-sparc
Date: 11/28/2001 03:03:26
[ On Monday, November 26, 2001 at 20:32:00 (-0700), Don Yuniskis wrote: ]
> Subject: Re: Hardware questions
>
> Now, having said all that, when was the last time you walked into a 
> room to find a system that had panicked due to a parity error?  :>

Well, now there's the kicker isn't it.  I've got a machine which does
generate the occasional log message with an occasional ECC "single-bit
error corrected" message, but I don't remember ever getting a panic from
a 2-bit error.  However I also don't know for sure that the machine's
with ECC which I've used longest have the proper systems support
software to detect and handle such a problem.  Testing error handling is
sometimes quite difficult.

On the other hand I've run many machines over the years with no parity
or ECC RAM and this has caused me to always wondered whether or not the
sometimes strange behaviour of various programs on those machines has
been caused by the odd bit being flipped unexpectedly in their RAM.

> The SPARCs I have here have uptimes of several weeks already.

I was just poking about on a customer's sparc (a bigger, newer, E250
model IIRC, with about 2GB of ECC-protected DRAM) which has an uptime of
well over 700 days now.  Unfortunately I don't know if it has detected
any memory errors yet or not (nor whether the applications it is running
would use memory in a way that might easily mask a few rare errors).  I
don't know what the chances are that an error has occurred, but I do
know that if an undetected error did occur on it there'd be hell to pay,
even though it doesn't really operate in a life-critical situation.

> I ran a FreeBSD box (486 with 36 bit memory) with a 300 day
> uptime before finally taking it down.
> 
> Note that parity can only reliably detect single bit errors.
> And, can't do anything to *correct* them.  So, in the event of 
> a parity error, the only thing the system *can* do is "stop"
> (effectively).

Indeed -- but do you know for certain that the software you were running
would correctly notice and handle a parity error (or a 2-bit ECC error)?

Are you sure you even had the parity error detection circutry enabled on
the motherboard, and that it functioned properly?  I've seen several
cheap PC boards which simply didn't properly implement DRAM parity error
detection.

Finally do you know if the applications you were running on that system
were using memory in a way that would not have masked any errors that
might have occurred?  I.e. do you know that FreeBSD would have noticed
at all if that particular motherboard had reported a memory parity error
to it?  I'm not saying it wouldn't have, but not knowing the motherboard
and not knowing which release you were running, I don't know that it
would have either.

Finally since some types of memory errors happen only as the components
reach the end of their servicable lifetimes, perhaps your system was
still not past the prime of its service lifetime.

I don't expect DRAM errors to always occur regularly everywhere.  But
that doesn't mean I don't want an error detection (and correction)
subsystem watching carefully over all my DRAM at all times.

> And, what is the relative *cost* for adding it to those systems?

The cost of building in hardware error detection (and correction) is
often trivial compared to the costs of unpredicatable, unexpected,
undetectable errors ever happening in a mission-critical system.

> And wouldn't you agree they are "bleeding edge" designs?  I.e.
> the types of things that push the technology to it's limit
> instead of operating safely within it?

No, absolutely not -- from a design perspective they're really not all
that new or different on some levels from machines designed decades
ago.  All that's changed is the design implementation!  ;-)
Indeed on the particular front of error detection and correction some of
the older systems were (necessarily, of course) more advanced.

> Would you put your *life* in the hands of a piece of equipment 
> that *didn't* use parity?  Yet, if you ever get wheeled into an
> operating theatre, chances are a good portion of the equipment 
> there does *not*.  And, while someone *is* watching over that 
> piece of equipment, there is no guarantee that he/she will
> recognize the fact that the device is reporting incorrect data.
> Or displaying one value and *acting* as if it was another.

I think you're missing out on, or conveniently ignoring, a very large
nubmer of important engineering factors which are combined to ensure
there are adequate levels of checks and balances to eliminate as much
risk as possible.

However I've no doubt that there are still many places where designers
have cut corners and purposefully mis-measured long-term costs to make
short term profits.  Sometimes the people making purchsing decisions
which can affect public safety have no clue about how to fully assess
the risks that may be inhernent in a complex device due to the choices
its designers made for whatever reason (ignorance, greed, survival).
Sometimes one has no choice but to accept certain risks in order to
avoid certain doom.

> You trust the cash register at the local store to correctly
> total your purchases and debit your credit card accordingly.

No, I don't trust them actually -- what I trust more though are all the
check and balances in the whole system from the store owner's experience
and integrity to the bank and so on, including my own careful
observations and cross-checking.

> Have you ever seen one "crash" with a "parity error" message?

Well, not "crash", but yes, I've actually seen equipment like that which
has suddenly begun to generate incorrect results, sometimes randomly or
sporadically.

> How many pharmaceuticals do you injest in a particular
> day/month/year?  Do you think there is "parity" (or ECC
> in any form) ensuring that the "process" is behaving
> properly?  Do you think the employee standing beside the
> equipment can *visually* determine that a mistake has
> been made and gone "undetected"?

Luckilly one usually doesn't have to rely on a single human backing up a
single device (though there are perhaps hundreds of thousands of
documented case histories of engineering design flaws in whole systems
which have resulted in failures and I've no doubt that a good number of
them are due to systems design flaws where only one human is in place to
check the results of a critical device that has no inherent error
detection ability itself).

> I have yet to design a product that employs *any* ECC
> hardware.  Yet, I can't recall ever hearing of
> anomalous behaviours that might be attributable to
> a "spontaneous memory corruption event".

There are many factors which could mask the conditions in which errors
could occur.  However just because a particular pattern of memory usage
reduces the risk it doesn't mean there is no risk.

> The next product I'm working on has 256M of 60ns EDO
> in it -- a "modest" amount... perhaps not large enough
> to rival bigger desktop computers, etc.  It will run
> 24/7.  There are no provisions for detecting memory
> errors.  Even if I could, there's nothing I could do 
> besides latch the error and tug on the RESET line
> and hope someone comes around and notices the problem.
> 
> Am I worried?  No.

Well, perhaps depending on what the device does, and how exactly the
memory is used in it, perhaps you should be worried.

>  But, I'm *terrified* of the fact
> that some bozo will open up the box and decide to
> swap SIMMs, etc.  :-(  (normally, I solder down the
> memory).

Hmmm -- eliminating connector problems could get rid of one of the very
large and significant factors which can contribute to hardware errors in
DRAM subsystems!

The bottom line is that sometimes hardware error detection and
correction just isn't necessary, but in a general purpose computer
system that might possibly be used in a high-reliability, high-
availability it's generally better to assume that hardware errors need
to be detected by the hardware itself as best as possible, and perhaps
even automatically corrected too (which is just another way of helping
predict better when a catastrophic uncorrectable error might be coming
soon).

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>     <woods@robohack.ca>
Planix, Inc. <woods@planix.com>;   Secrets of the Weird <woods@weird.com>