tech-kern: Re: Journaling for FFS

Subject: Re: Journaling for FFS
To: None <tech-kern@NetBSD.org>
From: Cary G. Gray <Cary.G.Gray@wheaton.edu>
List: tech-kern
Date: 10/05/2006 13:05:08
The "log near the inodes" idea misses one of the advantages of 
journalling, which is to improve the locality of writes by concentrating 
them in the log.  Journalling just the metadata wins, because that's where 
lots of small, carefully ordered, writes can be required.

Working out how to journal isn't trivial.  Log analysis has to find the 
last complete log record, and modern disk interfaces are likely to write 
the blocks of even a large contiguous write out of order.  So there is 
considerable care required...

Fifteen years ago I worked for DEC, and I was in the middle of a project 
that included adding metadata journalling to the FFS.  We used Bob 
Hagmann's paper on the Cedar File System as our roadmap.  I left DEC 
before the work was completed; Uresh Vahalia finished implementation of 
the filesystem code and wrote a nice paper for USENIX.

Here are the references:

Robert B. Hagmann, Reimplementing the Cedar File System Using Logging and
   Group Commit, Proc. 11th ACM Symposium on Operating Systems Principles,
   pp. 155-162, Austin, TX, 1987, ACM.
(You can find it in the ACM Digital Library or through CiteSeer at
http://citeseer.ist.psu.edu/hagmann87reimplementing.html)

Uresh Vahalia, Cary G. Gray, and Dennis Ting. Metadata logging in an NFS
   server. In Proceedings of the Winter 1995 USENIX Technical Conference,
   pp. 265-76, New Orleans, LA, January 1995. USENIX.
(See
http://www.usenix.org/publications/library/proceedings/neworl/vahalia.html
for a link to it in PostScript.)

Hagmann's paper is especially good on log management and analysis of
possible failures, including what happens if you fail in the middle of
writing a log record.  The way that bits get to a disk has changed a 
lot in twenty years; so Hagmann's analysis would not apply, but he does 
provide a good example of careful analysis of assumptions.

The approach we took was mostly-physical redo logging, as in Hagmann. 
The biggest complication from the FFS data structures is that some blocks 
change between being metadata and regular file contents--those containing 
directory data and indirect blocks.  Those transitions must somehow be 
dealt with; what I figured out back then was a way to do it via multiple 
passes during replay, one way for each level of indirection.

It would probably be better to log the transitions of blocks to/from 
metadata status, so that decision about whether to replay modificiations 
can be based on log analysis alone.  If you do something like Hagmann's 
transaction header/trailer record scheme, you can leave room in those 
records to record the changes--you probably need to pad to keep blocks 
aligned, anyway.

Doing so would also allow you to log and lazily reclaim the blocks pointed 
to by indirect blocks:  you can log to indicate that a particular 
nth-indirect block should be freed.  The actual freeing of the blocks 
pointed to by it can be recorded later along with the actual freeing of 
the block.  That spares you from having huge transaction records for 
delete/truncate--you end up with some number of later transactions to 
handle the freeing.  Recovery would need to enqueue any intend-to-free 
indirect blocks it finds that haven't completed so that the freeing could 
be completed.

Now that I think about it, such an "intent to free" record could be one 
way to handle the removed-but-still-open inode problem.  You can record 
its removal from the directory and an "intend-to-free" for the inode. When 
it is finally closed, you can then either synchronously log it to be freed 
(including logging freeing the blocks it points to), or you could enqueue 
it for lazy reclamation (possibly piggybacked on another transaction). 
If a crash (or forced remount readonly, etc.) happens before it is 
reclaimed (i.e., before the actual free is written to the log), 
recovery--whenever it runs--would handle the actual freeing of the inode 
and any associated blocks.

The mostly-physical logging scheme you get in this way includes logical 
(operation) logging of just a few important state transitions for blocks 
and inodes:  allocation (as data), allocation (as metadata) [both of which 
might include zeroing], intent-to-free (for inodes and blocks of indirect 
pointers), freeing.  I'd have to spend some time working on the details to 
tell whether that list is complete...

As you can probably tell, I've wanted to get back to putting these ideas 
into an implementation, but (as right now) I always seem to need to be 
taking care of one of my classes.

 	Cary Gray