Subject: Re: FFS journal
To: Bill Studenmund <wrstuden@netbsd.org>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-kern
Date: 07/06/2006 00:26:51
On Wed, Jul 05, 2006 at 03:02:52PM -0700, Bill Studenmund wrote:
> > The point could also be to make fsck faster, and that is usually the feature
> > the average user sees.
> 
> I think we're talking past each other a bit.
> 
> I'm not trying to say that fsck doesn't cause interest in journaling. It 
> really does. The fact you can get a multi-TB fs back up in seconds as 
> opposed to an hour or a few hours is VERY interesting.
> 
> But continuing to order MD writes when you have a journal is like trying 
> to shove softdeps into the journal. Part of the idea of a journal is that 
> a transaction happens or it doesn't. So all of the changes for an 
> operation are in the same journal entry (or all the steps of a given 
> stage, when you're deleting a huge file).

Sure. My concerns are if we loose the journal. fsck on a multi-terabytes
volume is not fun, but restoring a multi-terabytes volume from tapes
is not either, and I'm sure fsck would still be faster.


> If we then order MD writes, 
> after we write a transaction to the journal, we have to scribble out a 
> sequence of writes before we can mark the transaction as done.
> 
> That will 1) slow us down. The run-time performance gain of journaling is
> that we get rid of a stream of sequenced MD writes.

I'd like to see numbers about this. On a muti-disk raid set, with
battery-backed cache I'm not sure it matters. It may matter a little on
lower-end hardware, but on such systems it may be acceptable to go full
async. 

> 2) Since we are caring
> about on-disk state, we may need to build part of softdeps into this. If

My idea was to add journaling to softdep.

> we are performing a MD sequence that before-hand would have had two
> different changes to a block (say we update an inode, do stuff, update the
> inode again), we then have to recreate the block between operations, write
> it, then write the block as it ended up.

Sure. It adds entries to the journal, but that's all I see. Otherwise it
would be just like 2 updates on 2 different blocks. I'm probably missing
something.

Anyway, that's my last mail on the subject for the next 2 weeks :)

> > > > > (that's what you have the journal for!)
> > > > 
> > > > It's also to make fsck faster. I'd prefer to have at last the option to
> > > > keep ordered writes and have fsck deal with the replay, so that we're
> > > > guaranteed to always be able to read the filesystem (and eventually repair
> > > > it) even if the journal is corrupted.
> > > 
> > > If you lose the journal, you need to do a full fsck before touching the 
> > > file system.
> > 
> > But with unrdered metadata writes you may end up with an unrecoverable
> > filesystem.
> 
> Maybe. But that can happen with any journaled file system. And there will 
> always be ways to ruin a file system.
> 
> To the extent this is a concern, I think a much better way to protect
> against it is to put your journal on a RAID 1 or a RAID 10. From the work
> I've done in the guts of file systems, life will be MUCH EASIER and much
> more bug-free if we just improve the storage reliability of the journal
> rather than try to continue to order MD writes.

That's possible. But I'm concerned about the journal integrity on modern
disks, which lies about a lot of things (and, amongst others, sector size).
Low-end storage also exists and we have to live with it.
Journaling needs truly atomic writes to disk, much more than a traditional
ffs. How we can achieve truly atomic write with modern disks needs more
thoughs.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--