Subject: Re: Hardware questions
To: NetBSD/sparc Discussion List <port-sparc@NetBSD.ORG>
From: David Maxwell <david@vex.net>
List: port-sparc
Date: 11/28/2001 11:36:39
On Wed, Nov 28, 2001 at 03:03:26AM -0500, Greg A. Woods wrote:
> [ On Monday, November 26, 2001 at 20:32:00 (-0700), Don Yuniskis wrote: ]
> > Now, having said all that, when was the last time you walked into a 
> > room to find a system that had panicked due to a parity error?  :>
> 
> On the other hand I've run many machines over the years with no parity
> or ECC RAM and this has caused me to always wondered whether or not the
> sometimes strange behaviour of various programs on those machines has
> been caused by the odd bit being flipped unexpectedly in their RAM.

Similar examples come to mind - A Cisco 7513 ran for over a year, then
panic'd without any particularly unusual network activity. Cisco sent a
replacement, and we waited for the problem to happen again to have a
reason to swap the CPU out, and it never did...

Was that a flipped bit? No way to prove it or disprove it.

> > I ran a FreeBSD box (486 with 36 bit memory) with a 300 day
> > uptime before finally taking it down.
> 
> Are you sure you even had the parity error detection circutry enabled on
> the motherboard, and that it functioned properly?  I've seen several
> cheap PC boards which simply didn't properly implement DRAM parity error
> detection.

Every 'personal computer' 486 I ever saw came with the partity memory
setting in the Bios defaulted to 'no'.

> > Would you put your *life* in the hands of a piece of equipment 
> > that *didn't* use parity?  Yet, if you ever get wheeled into an
> > operating theatre, chances are a good portion of the equipment 
> > there does *not*.  And, while someone *is* watching over that 
> > piece of equipment, there is no guarantee that he/she will
> > recognize the fact that the device is reporting incorrect data.
> > Or displaying one value and *acting* as if it was another.
> 
> I think you're missing out on, or conveniently ignoring, a very large
> nubmer of important engineering factors which are combined to ensure
> there are adequate levels of checks and balances to eliminate as much
> risk as possible.

Until a software bug in the chemotherapy machine irradiates a hole in
you...

> > You trust the cash register at the local store to correctly
> > total your purchases and debit your credit card accordingly.

How many times have you had a cash register give a wrong total, and
someone says 'how did that happen?' (I was almost charged $115,000.00
for an $8.00 bottle of carpet cleaner once...)

Yes, it could be operator error, but just like with the Cisco example
above, _you can't be sure_.

Also, cash registers are a bad example to use, since their active duty
cycle is so low, even if they were highly vulnerable (hardware) the odds
that the error would happen other than while the register was sitting
idle are very small (even during the upcoming annual Christmas rush ;-)

-- 
David Maxwell, david@vex.net|david@maxwell.net --> Mastery of UNIX, like
mastery of language, offers real freedom. The price of freedom is always dear,
but there's no substitute. Personally, I'd rather pay for my freedom than live
in a bitmapped, pop-up-happy dungeon like NT. - Thomas Scoville