Subject: Re: FFS journal
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 07/05/2006 15:02:52
--idY8LE8SD6/8DnRI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jul 05, 2006 at 10:38:53PM +0200, Manuel Bouyer wrote:
> On Wed, Jul 05, 2006 at 01:31:19PM -0700, Bill Studenmund wrote:
> > On Wed, Jul 05, 2006 at 10:20:21PM +0200, Manuel Bouyer wrote:
> > > On Wed, Jul 05, 2006 at 12:49:54PM -0700, Jason Thorpe wrote:
> > >=20
> > > I think I explicitely said that we need to keep ordered MD writes for=
 this
> >=20
> > Uhm, the idea of a journal is that you no longer have to order MD write=
s.=20
> > As I understand it, that is the _point_. To retain ordering when we hav=
e a=20
> > journal is defeating the purpose.
>=20
> The point could also be to make fsck faster, and that is usually the feat=
ure
> the average user sees.

I think we're talking past each other a bit.

I'm not trying to say that fsck doesn't cause interest in journaling. It=20
really does. The fact you can get a multi-TB fs back up in seconds as=20
opposed to an hour or a few hours is VERY interesting.

But continuing to order MD writes when you have a journal is like trying=20
to shove softdeps into the journal. Part of the idea of a journal is that=
=20
a transaction happens or it doesn't. So all of the changes for an=20
operation are in the same journal entry (or all the steps of a given=20
stage, when you're deleting a huge file). If we then order MD writes,=20
after we write a transaction to the journal, we have to scribble out a=20
sequence of writes before we can mark the transaction as done.

That will 1) slow us down. The run-time performance gain of journaling is
that we get rid of a stream of sequenced MD writes. 2) Since we are caring
about on-disk state, we may need to build part of softdeps into this. If
we are performing a MD sequence that before-hand would have had two
different changes to a block (say we update an inode, do stuff, update the
inode again), we then have to recreate the block between operations, write
it, then write the block as it ended up.

I STRONGLY discourage the idea of ordering MD writes if journaling is in=20
use. You'll make a mess of things.

> > > > (that's what you have the journal for!)
> > >=20
> > > It's also to make fsck faster. I'd prefer to have at last the option =
to
> > > keep ordered writes and have fsck deal with the replay, so that we're
> > > guaranteed to always be able to read the filesystem (and eventually r=
epair
> > > it) even if the journal is corrupted.
> >=20
> > If you lose the journal, you need to do a full fsck before touching the=
=20
> > file system.
>=20
> But with unrdered metadata writes you may end up with an unrecoverable
> filesystem.

Maybe. But that can happen with any journaled file system. And there will=
=20
always be ways to ruin a file system.

To the extent this is a concern, I think a much better way to protect
against it is to put your journal on a RAID 1 or a RAID 10. From the work
I've done in the guts of file systems, life will be MUCH EASIER and much
more bug-free if we just improve the storage reliability of the journal
rather than try to continue to order MD writes.

Take care,

Bill

--idY8LE8SD6/8DnRI
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFErDcMWz+3JHUci9cRApoJAJsGxY1zJDGgt+0QkXCo4lpsSBl6pACcDe+s
sRYWDUbdzhulBzdu5uY9zKI=
=m6cK
-----END PGP SIGNATURE-----

--idY8LE8SD6/8DnRI--