Subject: wd, disk write cache, sync cache, and softdep.
To: None <tech-kern@netbsd.org>
From: Daniel Carosone <dan@geek.com.au>
List: tech-kern
Date: 12/16/2004 18:13:42
--We3kTPdwkBGsLLkt
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

I've written before about drive write cache issues, and about the mess
that can be made of a filesystem by the reordering or non-completion
of acknowledged writes that might happen with write cache on when
power fails.  This gets much worse, obviously, with softdep, which
depends even more heavily on ordering assumptions.

I've come to the general conclusion that if you're going to use
softdep, you need to disable write cache - and if you're going to
disable write cache, you pretty much need softdep or the performance
impact is truly horrendous.  Softdep will save you an awful lot of
sync rewrites updating the same inode or directory over and over.

The combination of the two works really rather well from a stability
point of view, but there is a performance impact, especially when lots
of small files and inode updates are getting flushed out, ping-pong
style, one small write at a time, and the disk channel is spending
most of its time idle. Write cache allows these writes to appear to
complete ~instantly, becase the drive lies to the host and says
they're done when they're only cached.  If the cache was in NVRAM,
that would be fine, but its not.

More expensive SCSI drives (and newer SATA disks, but we can't use
that functionality yet) allow multiple outstanding commands to
complete independently, and allow enforced ordering points where
needed.

Most IDE disks do however include a sync cache command.  I've been
trying to think of a good way to use this to achieve a good compromise
between the full speed of write-cache for multiple implied outstanding
commands, and the safety of one-write-at-a-time all the way to the
platters.

Previously, I was thinking about a special kind of disk transaction
that the filesystem could invoke, a bit like a memory or bus_space
barrier instruction, at each "sync point" when the filesystem was
supposedly stable.  That transaction would invoke the sync cache
command when it hit the wd driver (and might do other interesting
things in other drivers, like fss(4)).  Rumours are that Windows' NTFS
works something like this. The trouble was, that might take
considerable work to use properly.

Another idea was a cheesy hack to just call synccache every N disk
commands, rotor-fashion. I've tried a simulation of this, with a
"dkctl wd0 synccache" called in a loop with a 0.1s sleep, just to
measure the relative speed benefit, and it was significant. The
trouble with this, of course, is that the filesystem still thinks the
writes have completed when they may not have, and they may still get
reordered.

Today, another simpler alternative occurred to me.  What we want is
for the upper layers to have access to the multiple outstanding
commands that are implied in the disk's write cache, and confirmation
that they are on stable storage, even though we don't get individual
tagged completion events for them.

The idea is for the wd driver to issue commands to the disk, with
write-cache on, but to collect the ~immediate completion events in a
separate completion queue. The driver periodically (on a number of
conditions, see below) issues a sync cache command, and only when that
returns does it biodone() the previous requests in the completion
queue, all at once.  This way, there's no chance we're lying to the
filesystem about completed writes, but we can still stream requests to
disk quickly, and allow the drive to do internal reordering according
to physical geometry as well.  If the filesystem has ordering
requirements, it won't issue later writes until the earlier ones have
been reported as completed.

Conditions for issuing a synccache in this fashion might include:
 - all currently pending write requests in the completion queue
 - some threshold magic number of requests issued since the last one
 - some timeout since the last one, to try and reduce potential latency

but actually I think only the first condition is strictly necessary,
as long as the input queue gets deep enough to make the whoel thing
worthwhile. The worst problem case with write cache disabled is when
softdep issues a storm of tiny inode updates all at once, which is
exactly when this will work best.

Comments?=20

--
Dan.
--We3kTPdwkBGsLLkt
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (NetBSD)

iD8DBQFBwTWmEAVxvV4N66cRAkXrAJ9d9Aulm7uvATWczOZtvAFvUPODBACfW9qL
G44aF9O/XG16QwnoTB1DrLY=
=/pAV
-----END PGP SIGNATURE-----

--We3kTPdwkBGsLLkt--