Subject: Re: FFS journal
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 07/05/2006 17:31:29
--m972NQjnE83KvVa/
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jul 06, 2006 at 12:26:51AM +0200, Manuel Bouyer wrote:
> On Wed, Jul 05, 2006 at 03:02:52PM -0700, Bill Studenmund wrote:
> >=20
> > But continuing to order MD writes when you have a journal is like tryin=
g=20
> > to shove softdeps into the journal. Part of the idea of a journal is th=
at=20
> > a transaction happens or it doesn't. So all of the changes for an=20
> > operation are in the same journal entry (or all the steps of a given=20
> > stage, when you're deleting a huge file).
>=20
> Sure. My concerns are if we loose the journal. fsck on a multi-terabytes
> volume is not fun, but restoring a multi-terabytes volume from tapes
> is not either, and I'm sure fsck would still be faster.

As I suggested, let's look at what other OSs do for this. I think=20
strengthening the journal's storage is a better solution than MD ordering.

> > If we then order MD writes,=20
> > after we write a transaction to the journal, we have to scribble out a=
=20
> > sequence of writes before we can mark the transaction as done.
> >=20
> > That will 1) slow us down. The run-time performance gain of journaling =
is
> > that we get rid of a stream of sequenced MD writes.
>=20
> I'd like to see numbers about this. On a muti-disk raid set, with
> battery-backed cache I'm not sure it matters. It may matter a little on
> lower-end hardware, but on such systems it may be acceptable to go full
> async.=20
>=20
> > 2) Since we are caring
> > about on-disk state, we may need to build part of softdeps into this. If
>=20
> My idea was to add journaling to softdep.
>=20
> > we are performing a MD sequence that before-hand would have had two
> > different changes to a block (say we update an inode, do stuff, update =
the
> > inode again), we then have to recreate the block between operations, wr=
ite
> > it, then write the block as it ended up.
>=20
> Sure. It adds entries to the journal, but that's all I see. Otherwise it
> would be just like 2 updates on 2 different blocks. I'm probably missing
> something.

Oh. That's a key point.

A journal entry or transaction is a whole operation of some sort. It=20
touches all the blocks impacted by an operation.

So say we add a block to a file. A transaction will include updating the=20
inode, changing the free bitmap to indicate the block is in use, and=20
include updating the indirect block pointer table to include the new=20
block. So that's 3 writes in one transaction. Since adding the block=20
either happens or it doesn't we can't break the journal transaction into=20
more transactions that include only part of the operation.

If we have to add a new block of indirect pointers, marking the indirect=20
block as in use and updating whatever pointer points to it also has to be=
=20
in the transaction.

That's why I've been going on about ordering MD writes making a=20
transaction take longer. There will often be at least two or three blocks=
=20
in a transaction. So rather than doing two or three writes in parallel, we=
=20
now have to sequence them. And if we keep all of the intermediate state=20
views, we may have to wait for more writes.

> Anyway, that's my last mail on the subject for the next 2 weeks :)

Have a good trip.

> That's possible. But I'm concerned about the journal integrity on modern
> disks, which lies about a lot of things (and, amongst others, sector size=
).
> Low-end storage also exists and we have to live with it.
> Journaling needs truly atomic writes to disk, much more than a traditional
> ffs. How we can achieve truly atomic write with modern disks needs more
> thoughs.

What statistics do we have on journal integrity issues? Journals have been=
=20
in AIX for over 15 years, and I expect the IBM folks to be more concerned=
=20
about this than anyone here is. They've been in Linux for how long? Years?=
=20
We should be able to tell if this really is an operational concern.

And as before, if we're really concerened about this, let's just RAID the=
=20
journal. I've already suggested having multiple options for how=20
the journal is stored, let's just add more. Add ones that address your=20
concerns. We will end up with less code, we are much more likely to get=20
working code, and we will offer more options to our users.

Take care,

Bill

--m972NQjnE83KvVa/
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFErFnhWz+3JHUci9cRApqDAJ9kW5vJfr3vMF0y1vt8h8GjWqhJTgCglIto
30DTxe97t4rMnw9/L5aVqwM=
=THHL
-----END PGP SIGNATURE-----

--m972NQjnE83KvVa/--