Subject: Re: write cache on ATA drives
To: Jason R Thorpe <thorpej@wasabisystems.com>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 12/09/2002 19:02:21
On Mon, Dec 09, 2002 at 02:31:38PM -0800, Jason R Thorpe wrote:
> On Mon, Dec 09, 2002 at 11:27:25PM +0100, Manuel Bouyer wrote:
> 
>  > This would be expensive. I think you just need to know *some* commands are
>  > written to disk, but not all of them. So it's probably better to have this
>  > information per command, instead of barriers that would flush the whole cache.
> 
> Except, when you write an update to an inode that changes the file size
> indication, you bloody well want to make sure that the data blocks you
> just wrote are on the platter.
> 
>  > Hum, I think with tagged queuing, the command won't complete before
>  > data is on the media.
> 
> Are you *sure*?  I'm not certain that this is actually a requirement,
> and I have some emperical data which suggests that its not the case.  I.e.
> I have disks where tagged queueing is being used, and write speed goes WAY
> up when the write-cache is enabled.

If the write cache is enabled, typically the drive will return completion
for tagged commands when the data is in the cache and marked as dirty,
rather than when the data's actually been written to the disk.  Since the
default tag-reordering policy *even for simple tags* is "write in order",
what's going on is, basically, that you're letting the drive reorder
commands and find efficiency that wasn't already there; elsewise, with
enough tagged commands of sufficient length to fill the cache, there's no
difference.

Note that with the number of outstanding tags we use with most devices and
our maximum command size, explicitly turning on the write cache may allow 
more data to be "in flight" at once by freeing up tag slots, but we could
fix that by sending longer commands or by using more tags.  It's not like
if you don't turn on the write cache that RAM is not used at all; you may
just have a harder time filling it.

What we _should_ do, since we do write B_SYNC data with ordered tags (and
ordered tags force all pending simple tags to complete) is to change the
tag-ordering policy to "reorder at will" when the sd driver first
encounters the disk.  That'd work for softdep and it'd work for non-softdep
too, because the disk would still not lie about when the data was in
stable storage, which is *not* the case when you turn on WCE.

Empirical evidence suggests that it may be a good idea, when running with
that policy, to send an ordered tag every N commands just to guarantee
that commands don't linger in the cache _forever_, never completing at
all because it's not convenient to move the head over there.  But that
should not be so hard to arrange either.

Thor