Subject: Re: VIA VP2 chipset
To: None <current-users@NetBSD.ORG, port-i386@NetBSD.ORG>
From: Dave Rand <dlr@bungi.com>
List: port-i386
Date: 02/04/1998 21:15:13
[In the message entitled "Re: VIA VP2 chipset" on Feb  4, 23:44, Greg A. Woods writes:]
> 
> I'm not suggesting that we be so pedantic as to calculate every move
> three times over like NASA does when the lives of astronauts are at
> stake, but since we do know that DRAMs are by their very nature unstable
> and error prone, and since we do have the technology to detect the most
> common form of error they suffer, and even to correct most of them, why
> wouldn't we use it even if it costs a few dollars more?????
> 

Anyone remember the uucp system run by Pyramid in the bay area?

This was run on a very robust system, including ECC memory.  For many
hours (as I recall, it was over a weekend), messages sent through
pyramid would be corrupted.  Much head scratching.  UUCP used a reliable
transport protocol.  Messages arriving on UUCP or via the Internet had
equal probability of being corrupted.  The system wasn't down.

The cause?  The cache memory had failed.  Small amounts of high-speed
cache cannot (generally) be error checked or error corrected without
the addition of some serious time penalties.  Would you trade a 2x to 4x
performance decrease for a more reliable system?  This is the real cost.

The additional $10 for logic is minor.  The additional 20% to 30% for
more DRAM - well, that's a bit of a problem.  The additional performance
degradation for an ECC cache - I don't think so.

"Well, yes" says the sales-droid, "this system does run at less than
half the benchmark performance of our competition, and it does cost
nearly 50% more.  But it is *so* much more reliable.  No!  Wait!
Really, it is!  Come back!"

Adding ECC does have value, of course, and I'm being a bit silly.
But only a bit.  Consider the reliability of even the low end systems
today.  I have several Netbsd systems running, with uptime in the
100's of days.  Alpha.  PC's.  Even PC532's.

None of them have parity, nor ECC.  With a reasonable system design, and
good margins, it is my opinion that parity simply decreases the overall
system reliability.  More gates, more power, more heat, tighter timing
requirements... it's not worth it.

As memory size goes up, ECC *does* make sense, however.  With 512M to 1G of
RAM, even though ECC may add 50% to the access time of the DRAM, it seems
worth it.  Until I have to pay for the 30% additional memory required to run
it.  On personal systems?  Who cares.  For production use?  Maybe.
Redundant systems seem like a better way, though.

But there are alternatives.  The memory can be software ECC'ed.  Particularly
on old, about to be reclaimed pages.  30% overhead for this kind of makes
sense.  ECC on the kernel memory?  Why not!  ECC on Readonly pages?
Seems like a good thing.  Sweep the pages in the background, or in idle.
Run it in check-and-report-only for a few weeks, then check-and-fix for
a few months, and see what happens...

I looked at this many years ago, when memory was a lot more flakey than
today.  Anyone up for it?

-- 
Dave Rand
dlr@bungi.com
http://www.bungi.com