Subject: Re: wd, disk write cache, sync cache, and softdep.
To: Jason Thorpe <>
From: Daniel Carosone <>
List: tech-kern
Date: 12/17/2004 10:23:24
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Dec 16, 2004 at 02:04:04PM -0800, Jason Thorpe wrote:
> I think you are going down the road of "really bad performance" with=20
> this.

Well, maybe, but there are a few points here.

The first is that we may have become accustomed to unreasonable
performance because of write-cache being on by default. Steve's
comments about being able to do the wrong thing as quickly as you like
apply here.

Secondly, I run all my drives with write-cache off, and have spent
quite a bit of time watching the result.  The performance impact is
certainly noticable, but not unbearable. There is a clear area that is
the worst case (many small writes, such as when doing a cvs update of
pkgsrc+src+xsrc), and that seems to be mostly a matter of idle
interface time because only one command can be outstanding at a time
without tags.

Thirdly, the tuning and performance aspect seems to me to lie in the
selection of appropriate criteria to trigger a cache flush, balancing
the ability to flush once for a "large enough" set of writes that can
stream quickly into cache against making the first of those requests
wait "too long" in the completion queue.  Having the completion queue
at all is a correctness measure in the face of volatile cache.

Finally, it's about getting the most out of cheap hardware - not just
the most performance but the most reliability as well.  I can always
spend more on SCSI disks or NVRAM-backed controllers if my needs
demand it.

> What if the drive has a non-volatile cache?

Then we don't need this mechanism, and don't use it, just the same as
I'd turn the knob off if I was happy with fast-and-loose uncommitted
writes.  Even better if we can detect this automatically at probe time
and configure the driver defaults accordingly.

Have you ever seen an IDE disk (rather than a raid controller) with NV
cache?  If so, I'd like to know because I'll buy some.

> Why issue two commands for this when one will do (in the case of SCSI=20
> and the FUA bit).

This is where the performance tuning aspect comes in.  The idea is not
to issue two commands for one, but instead nine commands for eight
writes, or fifty-one commands for fifty writes.

Where there's enough concurrency available in wd's input BUFQ, we get
to stream all of these commands quickly to disk, we get to let the
disk controller do its internal sorting and scheduling so they
actually get written out optimally rather than forcing lots of extra
seeks as we would without write cache, *and* we get to be sure they're
done when we say they are.

In the write-cache-off case, small writes hurt doubly - we ping-pong
the IDE channel with one small outstanding request at a time, kermit
style, and we ping-pong the drive heads (and/or rotational delay) if
the blocks being written aren't sequential.  This proposal gives us
something akin to TCP's sliding window. SCSI TCQ gives us the
equivalent of SACK (or a reliable datagram protocol, really).

The completion of the sync cache (which is the ack that advances the
receive window) will be very quick if all the writes are nicely
clustered, and even if they're not it will still be much quicker than
1xRTT per write in non-optimal order.  Remember that the drive will be
writing those cached writes out to disk in the background even before
it gets a sync cache command.

Performance tuning, via the selection of criteria of when to issue
sync cache commands, is akin to deciding how large the window should
be, and whether to delay "acks" in the hope of more writes arriving

In the degenerate case where there's no concurrency in the BUFQ, and
only one request to do, yes we will issue two commands to the drive.
But such a system isn't very busy and won't notice the overhead, I

> >I wasn't aware that there was a problem for SCSI disks, as Bill
> >suggests, by the way. I don't know if the filesystems are smart
> >enough, or have the interface to indicate, that a given write should
> >be done as a tagged or ordered command, or whether they just wait to
> >issue later writes until earlier ones are returned.
> You don't have to care about this.  The application (in this case, FFS=20
> or whatever) enforces its own barriers of this type, by simply not=20
> starting an I/O before its dependant I/O is known to have completed. =20
> There is no need for us to fiddle with SIMPLE and ORDERED tags... just=20
> let the driver issue SIMPLE all it wants.

That's fine then; this proposal makes wd behave exactly like sd, as
seen from the BUFQ - just with completion of outstanding write
requests in bursts rather than individually according to internal
tags.  Completed writes are completed, when biodone is called by wd.

PS: I've been talking about wd, but the same would apply to ld if
there are raid controllers with volatile cache, or that use the
volatile cache in the drives (ataraid?).


Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.2.6 (NetBSD)