tech-kern: Re: wd, disk write cache, sync cache, and softdep.

Subject: Re: wd, disk write cache, sync cache, and softdep.
To: Steven M. Bellovin <smb@research.att.com>
From: Charles M. Hannum <abuse@spamalicious.com>
List: tech-kern
Date: 12/16/2004 22:31:15

On Thursday 16 December 2004 22:03, Steven M. Bellovin wrote:
> Examples are fine; the trick is to figure out the right answer(s) for
> the important cases, notably FFS.  (You're quite correct that in
> generaly, *two* synchronize requests are required for each critical
> block -- one to make sure that everything ahead of it is flushed, and
> one to ensure that the critical block itself is written immediately.

Really, this boils down to globally serializing I/O again.  To be blunt, any 
such idea is a non-starter.  The performance is so phenomenally bad in normal 
cases that it simply cannot be shipped.  (Been there, done that.)  You will 
*not* be providing users with a more "robust" system, because they will 
simply switch to something else that performs much better.

Keep in mind that we're not just talking about the performance of many 
transactions at once.  There is self-limiting behavior here (as there is in 
the loss of I/O sorting inherent in the use of tagged queueing) that kicks in 
at a certain point.  A critical problem is what it does to performance in the 
presence of a *small* set of transactions, as is typical for, say, a desktop 
system.

> I've often argued that it's pointless to do the wrong thing quickly,
> but people should at least know the tradeoffs.

You're assuming in that statement that systems are currently doing "the wrong 
thing."  To make such an assertion, though, you would have to define what 
"the wrong thing" is.

ATA disks, for example, guarantee (as strongly as they make any other 
guarantee, including whether you'll be able to read back the data at all) 
that all blocks cached for writing will eventually be written out, even if 
they have to be spared to do so.  Yes, there are things that violate this 
guarantee -- mostly having to do with catastrophic failures that would make 
the drive unreadable anyway.  The one major exception is power loss, but most 
"critical" systems have backup power.

The fundamental question here is what risk you're willing to trade off for 
what performance.  In general, the actual risk is quite low, and users would 
rather have excellent performance.  Large systems have backup power and 
backups.  They're not concerned with this level of nit-picking.

And for those that are concerned with this level of detail... we have much 
bigger problems in our file system code that need to be solved first.