Subject: Re: wd, disk write cache, sync cache, and softdep.
To: Daniel Carosone <dan@geek.com.au>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 12/16/2004 10:27:37
--3V7upXqbjpZ4EhLz
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Dec 16, 2004 at 06:13:42PM +1100, Daniel Carosone wrote:
> More expensive SCSI drives (and newer SATA disks, but we can't use
> that functionality yet) allow multiple outstanding commands to
> complete independently, and allow enforced ordering points where
> needed.
>=20
> Most IDE disks do however include a sync cache command.  I've been
> trying to think of a good way to use this to achieve a good compromise
> between the full speed of write-cache for multiple implied outstanding
> commands, and the safety of one-write-at-a-time all the way to the
> platters.
>=20
> Previously, I was thinking about a special kind of disk transaction
> that the filesystem could invoke, a bit like a memory or bus_space
> barrier instruction, at each "sync point" when the filesystem was
> supposedly stable.  That transaction would invoke the sync cache
> command when it hit the wd driver (and might do other interesting
> things in other drivers, like fss(4)).  Rumours are that Windows' NTFS
> works something like this. The trouble was, that might take
> considerable work to use properly.

I think it'd be better to add a "Force Unit Access" attribute to writes,=20
and have them set when we need the write to bypass the cache. Among other=
=20
things, this action will directly map to SCSI command semantics (i.e. you=
=20
fix the problem for both SCSI and IDE drives).

> Another idea was a cheesy hack to just call synccache every N disk
> commands, rotor-fashion. I've tried a simulation of this, with a
> "dkctl wd0 synccache" called in a loop with a 0.1s sleep, just to
> measure the relative speed benefit, and it was significant. The
> trouble with this, of course, is that the filesystem still thinks the
> writes have completed when they may not have, and they may still get
> reordered.
>=20
> Today, another simpler alternative occurred to me.  What we want is
> for the upper layers to have access to the multiple outstanding
> commands that are implied in the disk's write cache, and confirmation
> that they are on stable storage, even though we don't get individual
> tagged completion events for them.

Are you sure? I don't think those are quite the semantics we want.

> The idea is for the wd driver to issue commands to the disk, with
> write-cache on, but to collect the ~immediate completion events in a
> separate completion queue. The driver periodically (on a number of
> conditions, see below) issues a sync cache command, and only when that
> returns does it biodone() the previous requests in the completion
> queue, all at once.  This way, there's no chance we're lying to the
> filesystem about completed writes, but we can still stream requests to
> disk quickly, and allow the drive to do internal reordering according
> to physical geometry as well.  If the filesystem has ordering
> requirements, it won't issue later writes until the earlier ones have
> been reported as completed.

If you limit yourself to only "Force Unit Access" commands (to writes that=
=20
specifically request this behavior), then this trick would probably be=20
fine.

> Conditions for issuing a synccache in this fashion might include:
>  - all currently pending write requests in the completion queue
>  - some threshold magic number of requests issued since the last one
>  - some timeout since the last one, to try and reduce potential latency
>=20
> but actually I think only the first condition is strictly necessary,
> as long as the input queue gets deep enough to make the whoel thing
> worthwhile. The worst problem case with write cache disabled is when
> softdep issues a storm of tiny inode updates all at once, which is
> exactly when this will work best.

Your first condition didn't parse. As I understand you, the completion=20
queue will only contain pending write requests, so I don't see how that=20
can (or should) trigger a synccache. :-)

Take care,

Bill

--3V7upXqbjpZ4EhLz
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFBwdOZWz+3JHUci9cRAsyOAJ9Ct/AhrPyqyhC2Mv8BVrwiPyOyXACdGW8/
nvYbdyW56FouxmQCXXmXNbY=
=3MkE
-----END PGP SIGNATURE-----

--3V7upXqbjpZ4EhLz--