Subject: FFS journal
To: None <tech-kern@netbsd.org>
From: Kirill Kuvaldin <kirill.kuvaldin@gmail.com>
List: tech-kern
Date: 07/02/2006 19:59:50
Hi guys,

I'm about to add journaling mechanism to FFS as a part of my summer of code
project (http://netbsd-soc.sourceforge.net/projects/jffs/).

I'd like you to look at the preliminary version of a design document
and hear your feedback. ;)

Thanks in advance.

-----------------------------------------------------------------------------

 SUPPORT FOR JOURNALING FOR FFS: PROJECT PLAN
 Author: Kirill Kuvaldin (kirill.kuvaldin@gmail.com)

 I. PROJECT DESIGN

* One of the main sources of confusion about journaling is what exactly a
  journal contains. In the vast majority of journaling filesystems
  journal only contains modifications made to filesystem metadata
  (i.e. changes to directories, inodes, inode and block bitmaps). The
  journal doesn't contain any user data stored in a file.

* The key point of journaling mechanism is a notion of transaction.
  A transaction is a complete set of modifications made to the on-disk
  metadata during one operation (e.g. creating a file or directory). The
  main role of a transaction from the high-level point of view is to be
  atomic with respect to system failures -- i.e. a transaction happens
  completely, or it doesn't happen at all.

* The biggest benefit from journaling can be achieved when using it with
  the asynchronous semantics. It means that data blocks forming one
  transaction are not flushed to the disk as soon as the transaction is
  completely written to the journal area. When the data is later flushed
  from the cache, the cache manager can sort the list of blocks by their
  disk addresses, which minimize the seek time when writing the blocks.
  Asynchronous journaling filesystem can be several times faster than a
  traditional synchronous FFS.

* The technique of batching multiple transactions into a single
  transaction is often known as *group commit*. Group commit can offer
  significant speed advantage to a journaling filesystem because it
  amortized the cost of writing to disk over many transactions.

* The operations considered to be transactions are listed below:
  - create a file;
  - delete a file;
  - rename a file (including deletion of the existing name);
  - change the size of a file (growing or shrinking);
  - ... (anything else?)

* When a journaled filesystem reboots, if there are any journal entries
  that were not marked as completed, the system must *replay* the entries
  to bring the metadata structures up-to-date. Replaying the journal
  prevents partial updates because each journal entry is a complete,
  self-contained transaction.

* The biggest bottleneck of journaling mechanism is that all transactions
  write to a single journal. A single journal effectively forces the
  filesystem into a single-threaded model for updates. This is a serious
  disadvantage when the concurrent semantics is required. Due to the time
  constraints to a summer of code projects the support for multiple
  journals will not be implemented under the scope of this project.

 II. DELIVERABLES

 Mandatory:

 * Finishing design document (expected to deliver by the 7th July):
   - resolve all open issues;
   - probably make several modifications to desing according to mentors
     (or community) feedback.

 * Implementing journaling core features (expected to deliver by the 31st
   July):
   - new source files (maybe "jffs.{c,h}"?) in sys/ufs/ffs/ directory
     containing the implementations of journaling functions (described
     later in this document);
   - the set of patches providing modifications to existing FFS and UFS
     layers to add hooks for journaling code;
   - the set of patches providing modification to the NetBSD
     configuration to make it possible to use the journal as an option;
   - maybe to provide single cummulative patch containing all the
     modifications to make it easier to apply it to the NetBSD source
     tree.

 * The set of unit tests (expected to deliver by 13th August):
   - to ensure that code works correctly;
   - to reveal the possible bugs;
   - to investigate code behaviour under some sort of boundary cases;
   - to analyze the filesystem performance.

 Optional:

 * Support for batching transactions:
   - it may be a significant performance win.
 * API Documentation:
   - it may be helpful for the developers to understand what the
     journaling code does and how to use it.
 * Benchmarks:
   - to compare the filesystem performance with different mount options:
     with journal, with soft updates, without them.
   - to compare the FFS performance against the performance of other
     journaling filesystems (NB: I don't know whether it makes sense
     because these benchmarks can't be performed only within NetBSD due
     to absence of journaling filesystems).
 * Profiling:
   - to explore the code bottlenecks, probably to optimize some parts of
     code;
   - to investigate code coverage.

 III. TECHNICAL DETAILS

 * Journal internals:
   - The journal area (or log area) used to write journal entries is a
     fixed data allocated at filesystem initialization. The filesystem
     superblock  must maintain a reference to the journal area which also
     contains its own superblock where some sort of necessary information
     is stored. Two indices (start_index and end_index) that point to the
     start and end of the active area of the journal that is used in
     circular fashion, simply mark the bounds of the journal that contain
     active transactions.

 +-------------+------------------------+------------------------+------+
 |  Journal    |                        |                        |      |
 | superblock  | t r a n s a c t i o n  | t r a n s a c t i o n  |      |
 |+-----------+|+-------+ +-------+     |+-------+ +-------+     |      |
 ||start_index|||       | |       |     ||       | |       |     |      |
 ||end_index  |||       | |       | ... ||       | |       | ... | .... |
 || ...       |||       | |       |     ||       | |       |     |      |
 |+-----------+|+-------+ +-------+     |+-------+ +-------+     |      |
 +-------------+------------------------+------------------------+------+
  Figure 1: Journal area on-disk representation

 * Journaling API:
 The following ideas inspired from the BeFS textbook (see [2]). Although,
 there are only 3 functions for journal management, it may be enough for
 the rest part of filesystem to interact with journaling code.

 - jffs_start_transaction():
   o acquire the journal semaphore, holding it under the transactions
     completes;
   o ensure that there is enough space available in the journal to hold
     this transaction and in case there is - make some preparation
     actions and allocate the necessary transaction structures;
     otherwise, to force flushing blocks out of the cache, preferably
     those that were part of previous transactions;
   o set the state of transaction to *running* allowing the filesystem
     code to add new blocks to form the transaction structure.

 - jffs_write_blocks():
   o during a transaction any code that modifies a block of the
     filesystem metadata must call this function on the modified data;
   o for the sake of performance it may be possible to modify only the
     in-memory journal structures and later flush them to the log.

 - jffs_end_transaction():
   o at first this function turns a transaction into the *locked* state,
     meaning that no more block can be added to the transaction;
   o write all in-memory transaction blocks to their appropriate places
     into the journal area. When the last block is written to the
     journal, the transaction is considered to be *finished*;
   o set the callback function that will change the transaction state to
     *completed* as soon as the journal entry will be completely flushed
     to disk;
   o release the journal semaphore.

 * Journaling constraints on the cache subsystem:
   - journaling code must be able to lock disk blocks in the cache to
     prevent them from being flushed.
   - journaling code must know when a disk block is flushed to disk. It
     may be achived with callback functions if cache subsystem supports
     them. When the last block forming the transaction is flushed to
     disk, the transaction considered to be completed.

 IV. OPEN ISSUES

 * Journal location: The journal can reside on the same device as the
   rest part of file system does, but putting the journal on the
   different device may be a performance win. E.g., Seltzer paper [3]
   describes the approach when the journal is managed by write-ahead
   filesystem (wafs). It may require some additional time to implement
   the similar functionality.
 * The maximum size of a transaction is to be determined.
   The jffs_start_transaction() function uses this information to ensure
   that there is enough space available to hold the new transaction.
 * The functionality for recovering after the crashes (journal replaying)
   need to be clearly defined and added to this document.
 * Probably the list of transactions must be extended.

 V. ADDITIONAL INFORMATION AND REFERENCES

 [1] Marshall Kirk McKusick, Keith Bostic, Michael J. Karels,
     John S. Quatermann
     "The Design and Implementation of the 4.4 BSD Operating System".
     Addison-Wesley Professional, ISBN 0201549794. 1996.
 [2] Dominic Giampaolo.
     "Practical  File System Design with the Be File System".
     Morgan Kaufmann Publishers, ISBN 1558604979. 1999.
 [3] Seltzer et al.
     "Journaling vs. Soft Updates: Asynchronous Meta-data protection in
     File Systems".
     USENIX Technical Conference. 2000.
 [4] The NetBSD Developers. "NetBSD Internals".
     Copyright (c) 2006 The NetBSD Foundation.
 [5] NetBSD Source Code
     http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/