Subject: Re: RAID-5 benchmark results
To: NetBSD User's Discussion List <netbsd-users@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: netbsd-users
Date: 12/14/2001 13:57:47
[ On Friday, December 14, 2001 at 12:26:39 (+0000), Gavan Fantom wrote: ]
> Subject: Re: RAID-5 benchmark results
>
> On Thu, 13 Dec 2001, Greg A. Woods wrote:
> >
> > Sure there's no triple redundancy here -- but you really don't need it
> > unless you've got terribly unstable systems and power!  With a UPS on
> > production systems the risk of damage is nearly nil.
> 
> Sadly, this is not always the case. UPS systems fail. Power leads are
> knocked loose from either the UPS or the machine. People unplug the wrong
> cable while doing maintenance. PSUs fail.

Not I did not say the "risk is nil", I sais "nearly nil".  You
presumably go to the cost of adding and maintaining a UPS because it
reduces the risk to your data.

You'll find out about if the UPS itself fails pretty damn fast because
it'll probably be equivalent to having a power failure without a UPS. [*]

Yes, PSUs fail -- people with serious needs have seriously redundant
power supplies.  For example my hardware RAID arrays have dual hot-swap
power supplies for the controller, and one power supply for every drive
(so that a failure is equivalent to a drive failure and whatever RAID
protection you have will still protect your data).

No, my RAID arrays don't have redundant power plugs, but I don't think
my application really needs to go quite that far in disaster avoidance.
I mitigate that risk by never doing maintenance whenever something's
using their filesystems actively (which is usually pretty easy to do
since I'm their primary user and I haven't figured out how to be in two
places at the same time! ;-).  I would also normally mitigate that risk
by having daily backups, but my tapes are FUBAR right now....  :-(

You have to do a real and unbiased risk analysis to understand whether
or not you really need triple redundancy.  I find that computer people
like us are in particular very prone to believing we can apply technical
controls to mitigate every risk, and often without weighing the costs of
those controls vs. the risks they're mitigating.  In this thread people
have talked about not allowing write-back caching because it is risky
and then folks like you are saying that no level of risk is acceptable.
Well that's just not true.  The cost of not allowing write-back caching
might be the difference between a viable system and an unviable one.
Sometimes there may even be very little cost to losing your data,
believe it or not, and often it's those very situtations where
performance is a very important factor.

Certainly there's no reason to prevent people who know what they're
doing from trading off some relibility for higher performance!

> I'm sure I'm not the only person on this list who has seen all of the
> above happen.

I've seen it all happen too, and more, but that's the point.  You have
to understand the risks and the costs before you can know whether it's
worth the cost to mitigate those risks.  Sometimes the costs aren't all
monetary.  Slow performance can be a cost too, and if the added costs of
losing your data when weighted by the risks that could cause such a
loss, are lower than the costs associated with achieving the necessary
performance at zero risk, then you turn off write-through caching (or
turn on write-back caching, whichever is appropriate for your system)
and get back to work.

[*] As for having a UPS fail, well perhaps you're not using production
quality UPS units? :-)  I've got one now-dead little UPS that was
protecting a router and it turns out it's not an on-line UPS so it
wasn't possible to tell when it failed until after it was too late to do
anything about it.  However all my other UPS untis are of the full
on-line type.  If I was really worried about a UPS failure damaging data
that hadn't yet been written to permanent storage yet then I'd be
putting dual-redundant PSUs in every system and ensuring each PSU in a
given system was connected to a different UPS.

Even then if my data were worth it I'd also have a disaster recovery
plan for those times when the whole building, city block, or even city,
were to suffer some calamity that took my data offline.

There is no such thing as 100% guaranteed security.  You have to weigh
the risks against the costs of mitigating those risks (while at the same
time maintaining the necessary performance to keep your systems viable,
something that either has added costs, or added risks which turn into
added costs up-front if you try to mitigate them or after the fact if
you suffer them).

-- 
								Greg A. Woods

+1 416 218-0098;  <gwoods@acm.org>;  <g.a.woods@ieee.org>;  <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>