Subject: Re: Hardware questions
To: NetBSD/sparc Discussion List <port-sparc@NetBSD.ORG>
From: Don Yuniskis <auryn@gci-net.com>
List: port-sparc
Date: 11/26/2001 20:32:00
>Greg Woods explained:
>
>[ On Monday, November 26, 2001 at 02:11:21 (-0700), Don Yuniskis wrote: ]
>> Well, one can argue that if you *need* parity, you're already
>> dealing with a flakey system/design!  :>
>
>Yeah, but the flaky part isn't where you seem to think it is.
>
>Dynamically refreshed RAM is inherently flaky -- by design -- and far
>worse than static RAM, or real core memory, for example....  :-)
>
>Every, and I mean EVERY, system that expects to run for more than a few
>minutes between reboots, _needs_ some form of hardware error detection
>(or even prevention), in the memory subsystem (and the cache subsystems,

You will note that many *processors* are dynamic devices... shut the
clocks off and it "forgets".

>and the bus, etc.).  It's not just the refresh circuitry that you have

The refresh circuitry is a no brainer.  Nowadays, this is done
automatically and conservatively within the CPU/memory controller.

>to trust, but the very physics of the chip die and the packaging it is
>enclosed within.  At today's densities a single Alpha particle emitted
>fomr some impurity in the packaging material could wipe out several bits
>of data.  Even high-energy radiation that can penetrate the skin of your
>machine and the chip packaging can sometimes whack a bit of data.  8-bit
>parity protection of DRAM is essential, and ECC which can correct
>1-bit/word errors and detect 2-bit errors is better.


Now, having said all that, when was the last time you walked into a 
room to find a system that had panicked due to a parity error?  :>
The SPARCs I have here have uptimes of several weeks already.
I ran a FreeBSD box (486 with 36 bit memory) with a 300 day
uptime before finally taking it down.

Note that parity can only reliably detect single bit errors.
And, can't do anything to *correct* them.  So, in the event of 
a parity error, the only thing the system *can* do is "stop"
(effectively).

In the event that these dread events that you mention *are*
happening, then these systems must be seeing even numbers
of errors occuring -- else they would be detected (and
reported in some way?)

Years ago, DRAMs were really an art to work with.  You
had to control power supplies really well (tolerances,
noise, *sequencing* -- and for *3* supplies, not just *one*!),
watch ground bounce, signal overshoot/ringing, proper
setup/hold times (or risk "blowing a row") -- in addition
to things like refresh.

Nowadays, the semiconductor's technology has improved to 
the point where they are far less flakey.  Longer
refresh intervals; features like self-refresh; etc.

And, integrated controllers (inside the CPU usually or
in a chipset designed just for this purpose) instead
of having to hack things together with 74S' parts
and carefully chosen series terminations.

And, of course, cheap, multilayer boards with fine-line
rules commonplace -- smaller packages so your 512KB (!)
memory array is no longer the size of a modern PC motherboard!
(I can recall how "tickled" I was when ZIP and SIP
packages were being *proposed*...  :-/)

Nowadays, I think the real risk to memory systems comes 
from the packaging (SIMM/DIMM sockets are really not very
robust -- surely not as robust as soldered down components!);
the fact that Joe Average User thinks nothing of opening
up the case and pulling/replacing parts; "overclocked"
devices; poorly vented cases; etc.

>Would you really run your SCSI devices with parity detection disabled
>even when they work fine with it enabled?
>
>Would you really run your UDP-based network services with UDP checksums
>off?  Would you turn off TCP checksums if it were possible to do so?


But these are things that extend *outside* the controlled environment
of the machine, for the most part.  Would you design a parallel
port interface that toasted itself if the printer was powered on while
the interface was powered *off*?  Of course not!  Because it is
*likely* that someone will do this in the course of normal use.

Will someone fail to have the SCSI/CAT5 cable seated improperly
in it's mating connector?  Will some such cable get frayed/worn
with use/abuse?  Pretty likely.  Will some transceiver (internal
or external) fail to report a collision and, thus, the receiving
node thinks the packet he just received is "OK" (hint:  SQE)?

>Look, for example, at the hardware error detection and prevention
>measures implemented in large/high-end high-availability systems
>(eg. the new Sun E15K).


And, what is the relative *cost* for adding it to those systems?
And wouldn't you agree they are "bleeding edge" designs?  I.e.
the types of things that push the technology to it's limit
instead of operating safely within it?

>Why IIRC even the newer, "smaller" [:-)] Intel Pentium-III's use
>technology from DEC-Alphas to do ECC on major internal buses, etc.


How about I put it in more practical terms.  :>

Would you put your *life* in the hands of a piece of equipment 
that *didn't* use parity?  Yet, if you ever get wheeled into an
operating theatre, chances are a good portion of the equipment 
there does *not*.  And, while someone *is* watching over that 
piece of equipment, there is no guarantee that he/she will
recognize the fact that the device is reporting incorrect data.
Or displaying one value and *acting* as if it was another.

You trust the cash register at the local store to correctly
total your purchases and debit your credit card accordingly.
Have you ever seen one "crash" with a "parity error" message?

How many pharmaceuticals do you injest in a particular
day/month/year?  Do you think there is "parity" (or ECC
in any form) ensuring that the "process" is behaving
properly?  Do you think the employee standing beside the
equipment can *visually* determine that a mistake has
been made and gone "undetected"?

I have yet to design a product that employs *any* ECC
hardware.  Yet, I can't recall ever hearing of
anomalous behaviours that might be attributable to
a "spontaneous memory corruption event".

The next product I'm working on has 256M of 60ns EDO
in it -- a "modest" amount... perhaps not large enough
to rival bigger desktop computers, etc.  It will run
24/7.  There are no provisions for detecting memory
errors.  Even if I could, there's nothing I could do 
besides latch the error and tug on the RESET line
and hope someone comes around and notices the problem.

Am I worried?  No.  But, I'm *terrified* of the fact
that some bozo will open up the box and decide to
swap SIMMs, etc.  :-(  (normally, I solder down the
memory).

I won't sweat the much maligned "alpha particle"  :>
I'm more worried about the trouble to be caused by
the larger creature -- homo sapiens!

I *would* be curious, though, to hear if others *have*
seen memory problems that truly were "defects" and not
just "poorly installed" parts, etc.

--don