Subject: Re: disks write-back cache
To: Jason Thorpe <thorpej@wasabisystems.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-kern
Date: 04/28/2003 23:02:37
On Sat, Apr 26, 2003 at 10:00:44AM -0700, Jason Thorpe wrote:
> 
> On Saturday, April 26, 2003, at 09:05  AM, Manuel Bouyer wrote:
> 
> >This cause problems for filesystems or applications that take measure 
> >to
> >prevent problems in case of unattended reboot (e.g. FFS, or sendmail).
> >Until today I though I was safe when using SCSI disks.
> >I think the kernel should print a warning when it probes a disk with 
> >the
> >write cache enabled.
> 
> Perhaps the kernel should print the cache enabled status of the drive 
> with the autoconfiguration messages?

This is close to the "warning" I suggested :)
However for IDE this is useless: the cache parameters are not savable.

> 
> >I did some benchmarks here, and it seems tagged queuing mostly hide the
> >imrpovement of write-back cache. On two different filesystems (on top 
> >of
> >RAID-1 raidframe devices), I see a performance decrease of 2-5% writing
> >a 640MB file (tested on different servers, the decrease is dependant 
> >on the
> >disk model, and maybe filesystems parameters).
> 
> The tagged queueing thing is interesting.  It's actually a bit more 
> complicated than you describe.  The problem is that not all drives 
> allow commands to be reordered, so effectively every tag is an ordered 
> tag.  I believe the command ordering behavior is adjustable with a mode 
> page setting, but I don't remember which one.

the "queue algorithm modifier" bits of the control mode page, maybe ?
SPC-3 says that the default value should be "restricted reordering", which
really means no reordering.
I should try changing this, to see if it improve performances.
For a large sequential write, I'm not sure it would make much difference.

> 
> Anyway, if the drive isn't going to reorder commands, then your 
> performance can be really bad with the w/b cache disabled.  We should 
> probably have some dkctl(8) settings that allow tuning these other 
> kinds of disk parameters.

I don't know if this is general enouth to be in dkctl, and this affect all
devices implemeting tagged queuing, not only disk devices.
Maybe it would be better in scsictl (and implementing it in scsitl wouln't
require additionnal code in the kernel).

> 
> Also note that some drives will suffer tag starvation if you enable 
> command reordering, e.g. it will wait "forever" to complete simple-tag 
> commands because it's stupid :-)  The way to work around this is to 
> periodically send an ordered-tag command to the drive (or maybe even 
> when the number of openings on the drive crosses some low-water mark).

I can immagine a situation where we could have this situation with a minumum
number of queued commands (probably 3). Note that this can also happen with
read commands.
We can count the commands, and issue an ordered tag every N commands, with the
counter being reset when there's no command being queued.
I'll look at this, this should be easy to implement.

> 
> >Or maybe put it in the filesystem layer, at mount time ?
> 
> Well... Another idea might be to make the file systems w/b cache-aware. 

I've been thinking about this after sending my original mail

>  I've mentioned this idea to a few people before, but no one seems to 
> think it's necessary.  Anyway, the idea is that you make the file 
> system issue cache flushes at its own barrier points (either explicitly 
> with a separate command, or by setting a flag in its I/O request which 
> causes the disk driver to do so at the end of that I/O).

A flag is probably better, so that the driver knows it at I/O time (it may
be able to do some optimisations).
We also need to define the behavior of the write barrier: should all
write queued before the barrier be done before the barrier and all write
queued after be done after, or can the write barrier itself be reordered
with the previsouly queued I/O ?
The first behavior would require 2 cache flush for an I/O barrier.

Actually this would be very usefull, for IDE drives and SCSI without
tagged queuing.
For modern SCSI, using tagged queuing with the write-back cache disabled
and appropriate mode page settings it probably the most efficient.
But for IDE, we don't have much other way than to do cache flushes.
Disabling the write-back cache have a really important impact on
performances (and is not appropriate for e.g. swap partitions). Implementing
tagged queuing isn't going to solve it, because the way disconnect/reselect
has to be done on IDE isn't efficient at all (the host has to pool the drive).

Does anyone knows how other OSes handle it, especially for IDE ?
The problem is more and more important as disk caches grows.

> 
> However, that's a lot of work, so issuing a warning might not be a 
> horrible idea... but it should might be annoying to see them all the 
> time.

We can add a sysctl to disable the warning ... and add a dkctl rc.d script
run between fsck and root, which manage cache settings of disks, and this
sysctl.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 24 ans d'experience feront toujours la difference
--