Subject: Re: Extension of fsync_range() to permit forcing disk cache flushing
To: Jason Thorpe <thorpej@shagadelic.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 12/17/2004 17:06:15
On Fri, Dec 17, 2004 at 07:41:22AM -0800, Jason Thorpe wrote:
> 
> On Dec 17, 2004, at 1:08 AM, Manuel Bouyer wrote:
> 
> >Also note that you may not want to flush the cache in all cases. For 
> >example,
> >with SCSI tagged queuing, a write barrier would be enouth to meet this
> >constraint.
> 
> Not true.  When a SCSI tagged write is "complete", it may only be in 
> the drive's cache.  If you really really really want it to be on the 
> platter, you need to issue SYNCHRONIZE CACHE or use FUA on the 
> individual commands.

Yeah, but there's an 800lb gorilla here that everyone except for
Charles seems to be ignoring.

You can run SCSI drives with WCE (write cache enable) turned off, and
because they have efficient support for multiple outstanding tagged
commands, they can get acceptable write performance (with carefully
crafted applications and filesystems) *without* having to allow writes
to be marked as "complete" without being committed to stable storage.

That is, with SCSI disks, you can get acceptable performance for streams
of small writes without letting the drive lie about whether the data is
in cache or on the disk.

With IDE disks, you don't have disconnect/reconnect, multiple outstanding
tagged commands, or ordered tags to serve as write barriers.  Your *only
option* to allow the drive firmware to handle multiple commands at once,
potentially ganging them up for efficient long writes, or reordering them
for lower average latency, is to turn on the write cache.  If you don't
turn it on, you get performance like a SCSI disk without tagged queueing:
dreadful, often even for long writes, if you can't respond fast enough to
the completion of one command to get the next one there before you miss
a platter rotation.

The situation is simply not comparable, because the design of IDE is so
broken that you simply can't get acceptable write performance without
allowing the drive to claim it's completed writes when it hasn't.  That
is not the case for SCSI.

With newer SATA disks, *if we supported tagged command queueing*, which
we don't, and only with controllers that actually supported it, we could
do the right thing here.  But for the time being, with the hardware that's
actually available, the only way you can let the drive see enough of the
I/O stream at once for performance to not suck is to turn on that write
cache; and letting arbitrary user applications flush the whole thing will
suck, too.

Again, it's not just small writes -- it's all writes, because it needs
to write a lot more than 128K at a time (the max for the *one* outstanding
IDE command you get) to actually not miss rotations between writes.

The reason people generally don't lose with this, even though it is the
default for basically every IDE drive made, is that they put their
servers on UPSes.  That way they can tolerate the disk lying to them,
and still get performance that does not suck.

-- 
 Thor Lancelot Simon	                                      tls@rek.tjls.com

"The inconsistency is startling, though admittedly, if consistency is to be
 abandoned or transcended, there is no problem."		- Noam Chomsky