Subject: Re: wd, disk write cache, sync cache, and softdep.
To: Charles M. Hannum <abuse@spamalicious.com>
From: J Chapman Flack <flack@cs.purdue.edu>
List: tech-kern
Date: 12/16/2004 18:14:44
> rather have excellent performance.  Large systems have backup power and 
> backups.  They're not concerned with this level of nit-picking.

So how does the admin proceed after the failed power supply has been
replaced, the filesystems preened and restored, the last DBMS checkpoint
restored from backup and then rolled forward to point of failure from the
journal, the system put back on line, and the phone begins to ring about
some of Judge Smith's cases appearing to have the wrong defendants?
(Been there, done that sort of thing.)  How long does it take to (a)
estimate the extent of the damage, (b) sort it all back out?

The challenge for an admin is to put together a complicated system out
of many incompletely understood layers and be able to give reasonably
confident service guarantees and also be able to do reasonably effective
recovery forensics quickly when something unanticipated has happened.
There usually aren't a lot of simple, easily foreseen problems, because
the admin has foreseen them and made provisions.  It's the weird ones that
have to be figured out on the fly while management is pacing the floor,
and it just gets harder when some of the layers don't quite do what they
were documented and relied on to do.  The precise differences between what
any one layer says it does and what it really does may seem like nitpicking,
but those nits add up....

> And for those that are concerned with this level of detail... we have much 
> bigger problems in our file system code that need to be solved first.

it's certainly reasonable to prioritize....   :)

-Chap