Subject: Re: wd, disk write cache, sync cache, and softdep.
To: Daniel Carosone <dan@geek.com.au>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 12/16/2004 16:10:39
--WkfBGePaEyrk4zXB
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Dec 17, 2004 at 07:18:25AM +1100, Daniel Carosone wrote:
> On Thu, Dec 16, 2004 at 01:35:43PM -0500, Steven M. Bellovin wrote:
> > In message <20041216182737.GA8856@netbsd.org>, Bill Studenmund writes:
> > >
> > >I think it'd be better to add a "Force Unit Access" attribute to write=
s,
> > >and have them set when we need the write to bypass the cache. Among ot=
her
> > >things, this action will directly map to SCSI command semantics (i.e. =
you
> > >fix the problem for both SCSI and IDE drives).
> >=20
> > But under what conditions should higher layers set this flag?
>=20
> Exactly; this is somewhat akin to my earlier idea of a barrier-type
> operation that percolates through the softdep trees until it hits the
> disk and triggers a 'sync point'.  It would be nice to have,
> certainly, but non-trivial at best to implement.

I disagree. I agree it needs thought, but I think it can be done.

In another thread, I add code to VOP_SYNC() that forces cache flushing. So=
=20
it's not hard to create a cache-sync barrier.

But as below, I don't think you really need a barrier, you need the FUA=20
semantics, which we will probably implement as a barrier.

> The thing I liked about the completion queue idea is that it stays
> etirely within the disk driver layer, and merely restores the disk
> semantics that are assumed by all the upper layers: biodone buffers
> are safely on stable storage.=20

But those semantics are not always the right ones for the upper levels.
Yes, those are the ones everything grew up with. But write-back caches   =
=20
were added to disks (and enabled by default) for a reason - they let the
disk perform very well. Other OSs cope well with this. If we want to
perform well, we will need to as well.

I agree it's wrong to assume we don't have write caches when we do, but I=
=20
think you tackle the problem backwards. Rather than hide the write caches,=
=20
I think we need to start changing the upper code to deal with them.

> The ideas are not incompatible, of course: arrival of one a write with
> one of these force flags would be another condition to trigger a sync
> cache.
>=20
> I wasn't aware that there was a problem for SCSI disks, as Bill

I'd say the problem is more that we are ignoring the cache rather than the=
=20
cache is itself a problem.

> suggests, by the way. I don't know if the filesystems are smart
> enough, or have the interface to indicate, that a given write should
> be done as a tagged or ordered command, or whether they just wait to
> issue later writes until earlier ones are returned.

I'll describe more below, but ordered is NOT what we want.

> I assumed that they might issue sync writes, and that those might be
> mapped by sd to ordered commands in this fashion, when I suggested
> that this should be another completion criterion.  SCSI disks are
> supposed to complete all outstanding tagged commands before each
> ordered command, and we could emulate the same behaviour - all
> assuming we get the information from the upper levels.

We REALLY don't want to start using ordered commands here.

To make sure we're all on the same page, my understanding is that we want=
=20
to have a sequence of writes hit the disk media in a given order. We use=20
this to ensure file system consistency. Consider a set of writes A1, A2,=20
A3. We don't want some future (i.e. post-reboot) file system examiner to=20
see A2 without A1. Nor A3 without both A1 and A2.

While ORDERED commands are one way to do this, the big draw back with them=
=20
is that they impose ordering on any other write streams. Say there are two=
=20
other B1 and B2, and C1, C2, and C3. We don't care if it's A1, C1, B1, C2,=
=20
A2, A3, B2, C3 or A1, A2, A3, B1, B2, C1, C2, C3 or C1, A1, A2, B1, B2,=20
A3, C2, C3. Or any other permutation, as long as the individual orders are=
=20
respected. If we use FUA-type functionality, we can have up to three=20
commands outstanding at once, and we let the disk re-order them. If we use=
=20
ORDERED tags, we have at most one command being processed at once.

As another case, consider a journaled file system. You write to the=20
journal, wait for completion, then write to all of the blocks on the disk,=
=20
wait for completion, then mark the journal entry as done. We want to make=
=20
sure all of the "other writes" are finished before proceeding. Consider=20
the case where we have 20 updates in one journal entry. The "other write"=
=20
step really is 20 writes. We don't care which of them finish when, we just=
=20
care that we don't mark the journal entry as done until they all are; the=
=20
disk can reorder them as it wants.

I agree ignoring the write caches is an issue, I just think we will do=20
much better to deal with them rather than ignore them.

Take care,

Bill

--WkfBGePaEyrk4zXB
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFBwiP/Wz+3JHUci9cRAoueAKCCKeteK7fW6cgpLOyNFeK19XqjUgCgibwn
A3AHsRnDyzRt+9Nq9/fSQNo=
=K1TB
-----END PGP SIGNATURE-----

--WkfBGePaEyrk4zXB--