Subject: Re: FFS journal
To: Kirill Kuvaldin <kirill.kuvaldin@gmail.com>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 07/03/2006 10:16:50
--LZvS9be/3tNcYl/X
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Jul 02, 2006 at 07:59:50PM +0400, Kirill Kuvaldin wrote:
> Hi guys,
>=20
> I'm about to add journaling mechanism to FFS as a part of my summer of co=
de
> project (http://netbsd-soc.sourceforge.net/projects/jffs/).
>=20
> I'd like you to look at the preliminary version of a design document
> and hear your feedback. ;)
>=20
> Thanks in advance.
>=20
> -------------------------------------------------------------------------=
----
>=20
> SUPPORT FOR JOURNALING FOR FFS: PROJECT PLAN
> Author: Kirill Kuvaldin (kirill.kuvaldin@gmail.com)
>=20
> I. PROJECT DESIGN
>=20
> * One of the main sources of confusion about journaling is what exactly a
>  journal contains. In the vast majority of journaling filesystems
>  journal only contains modifications made to filesystem metadata
>  (i.e. changes to directories, inodes, inode and block bitmaps). The
>  journal doesn't contain any user data stored in a file.

You missed the difference between logical and physical journals. You=20
really want to do a physical journal as recovery is MUCH simpler.

> * The key point of journaling mechanism is a notion of transaction.
>  A transaction is a complete set of modifications made to the on-disk
>  metadata during one operation (e.g. creating a file or directory). The
>  main role of a transaction from the high-level point of view is to be
>  atomic with respect to system failures -- i.e. a transaction happens
>  completely, or it doesn't happen at all.
>=20
> * The biggest benefit from journaling can be achieved when using it with
>  the asynchronous semantics. It means that data blocks forming one
>  transaction are not flushed to the disk as soon as the transaction is
>  completely written to the journal area. When the data is later flushed
>  from the cache, the cache manager can sort the list of blocks by their
>  disk addresses, which minimize the seek time when writing the blocks.
>  Asynchronous journaling filesystem can be several times faster than a
>  traditional synchronous FFS.
>=20
> * The technique of batching multiple transactions into a single
>  transaction is often known as *group commit*. Group commit can offer
>  significant speed advantage to a journaling filesystem because it
>  amortized the cost of writing to disk over many transactions.
>=20
> * The operations considered to be transactions are listed below:
>  - create a file;
>  - delete a file;
>  - rename a file (including deletion of the existing name);
>  - change the size of a file (growing or shrinking);
>  - ... (anything else?)

As Matt pointed out, writing the inode out.

Also, allocating and deallocating blocks.

> * When a journaled filesystem reboots, if there are any journal entries
>  that were not marked as completed, the system must *replay* the entries
>  to bring the metadata structures up-to-date. Replaying the journal
>  prevents partial updates because each journal entry is a complete,
>  self-contained transaction.
>=20
> * The biggest bottleneck of journaling mechanism is that all transactions
>  write to a single journal. A single journal effectively forces the
>  filesystem into a single-threaded model for updates. This is a serious
>  disadvantage when the concurrent semantics is required. Due to the time
>  constraints to a summer of code projects the support for multiple
>  journals will not be implemented under the scope of this project.

Huh? With group commit, you get a lot of changes out in one flush, so you=
=20
aren't single-threaded. Yes, the journal is still a semi-synchronous block=
=20
point (all transactions in a group don't return until the journal's=20
written), but "single-threaded" doesn't sound right.

Also, I doubt you'll ever want more than one journal, for two reasons.=20
First, if your journal isn't processing writes fast enough, the=20
easiest/best thing to do is to put it on a beefier disk. Put it on a RAID=
=20
10 with a small stripe depth. Or buy real transaction disks.

Second, if you have multiple journals, you add an extra synchronization=20
point and/or other complexity. The big problem is what happens if we=20
reboot and two or more journals for the same file system have uncompleted=
=20
writes for the same blocks. i.e. we wrote "transaction complete" to one of=
=20
them then wrote a new transaction for the same blocks to the other=20
journal, and the "transaction complete" markings didn't make it to disk.

The best I can think up is that you have a time stamp on the journal, and=
=20
you play the "older" write first. But even that has problems. You now have=
=20
to have the journal replayer sort operations.

> II. DELIVERABLES
>=20
> Mandatory:
>=20
> * Finishing design document (expected to deliver by the 7th July):
>   - resolve all open issues;
>   - probably make several modifications to desing according to mentors
>     (or community) feedback.
>=20
> * Implementing journaling core features (expected to deliver by the 31st
>   July):
>   - new source files (maybe "jffs.{c,h}"?) in sys/ufs/ffs/ directory
>     containing the implementations of journaling functions (described
>     later in this document);
>   - the set of patches providing modifications to existing FFS and UFS
>     layers to add hooks for journaling code;

Since you're making transactions, you will probably need code in the exact=
=20
same places that the snapshotting code has transactions. Right now, that's=
=20
in the layers above the VFS/VOP calls.

>   - the set of patches providing modification to the NetBSD
>     configuration to make it possible to use the journal as an option;
>   - maybe to provide single cummulative patch containing all the
>     modifications to make it easier to apply it to the NetBSD source
>     tree.
>=20
> * The set of unit tests (expected to deliver by 13th August):
>   - to ensure that code works correctly;
>   - to reveal the possible bugs;
>   - to investigate code behaviour under some sort of boundary cases;
>   - to analyze the filesystem performance.
>=20
> Optional:
>=20
> * Support for batching transactions:
>   - it may be a significant performance win.

Yes.

> * API Documentation:
>   - it may be helpful for the developers to understand what the
>     journaling code does and how to use it.

Yes.

> * Benchmarks:
>   - to compare the filesystem performance with different mount options:
>     with journal, with soft updates, without them.

Sounds good. One thing to do in testing vs. softdeps is to make sure you=20
test enough so that softdeps actually starts flushing transactions to=20
disk. It will cache quite a lot, so you need to do enough work to fill up=
=20
its cache.

>   - to compare the FFS performance against the performance of other
>     journaling filesystems (NB: I don't know whether it makes sense
>     because these benchmarks can't be performed only within NetBSD due
>     to absence of journaling filesystems).

I don't think this makes much sense.

> * Profiling:
>   - to explore the code bottlenecks, probably to optimize some parts of
>     code;
>   - to investigate code coverage.
>=20
> III. TECHNICAL DETAILS
>=20
> * Journal internals:
>   - The journal area (or log area) used to write journal entries is a
>     fixed data allocated at filesystem initialization. The filesystem
>     superblock  must maintain a reference to the journal area which also

Check with Darrin Jewell (dbj@) about how exactly to do this in a
backwards-compatible manner.

Also, it would be nice to be able to retrofit journals into existing file=
=20
systems.

>     contains its own superblock where some sort of necessary information
>     is stored. Two indices (start_index and end_index) that point to the
>     start and end of the active area of the journal that is used in
>     circular fashion, simply mark the bounds of the journal that contain
>     active transactions.
>=20
> +-------------+------------------------+------------------------+------+
> |  Journal    |                        |                        |      |
> | superblock  | t r a n s a c t i o n  | t r a n s a c t i o n  |      |
> |+-----------+|+-------+ +-------+     |+-------+ +-------+     |      |
> ||start_index|||       | |       |     ||       | |       |     |      |
> ||end_index  |||       | |       | ... ||       | |       | ... | .... |
> || ...       |||       | |       |     ||       | |       |     |      |
> |+-----------+|+-------+ +-------+     |+-------+ +-------+     |      |
> +-------------+------------------------+------------------------+------+
>  Figure 1: Journal area on-disk representation
>=20
> * Journaling API:
> The following ideas inspired from the BeFS textbook (see [2]). Although,
> there are only 3 functions for journal management, it may be enough for
> the rest part of filesystem to interact with journaling code.
>=20
> - jffs_start_transaction():
>   o acquire the journal semaphore, holding it under the transactions
>     completes;
>   o ensure that there is enough space available in the journal to hold
>     this transaction and in case there is - make some preparation
>     actions and allocate the necessary transaction structures;
>     otherwise, to force flushing blocks out of the cache, preferably
>     those that were part of previous transactions;
>   o set the state of transaction to *running* allowing the filesystem
>     code to add new blocks to form the transaction structure.
>=20
> - jffs_write_blocks():
>   o during a transaction any code that modifies a block of the
>     filesystem metadata must call this function on the modified data;
>   o for the sake of performance it may be possible to modify only the
>     in-memory journal structures and later flush them to the log.
>=20
> - jffs_end_transaction():
>   o at first this function turns a transaction into the *locked* state,
>     meaning that no more block can be added to the transaction;
>   o write all in-memory transaction blocks to their appropriate places
>     into the journal area. When the last block is written to the
>     journal, the transaction is considered to be *finished*;
>   o set the callback function that will change the transaction state to
>     *completed* as soon as the journal entry will be completely flushed
>     to disk;
>   o release the journal semaphore.
>=20
> * Journaling constraints on the cache subsystem:
>   - journaling code must be able to lock disk blocks in the cache to
>     prevent them from being flushed.
>   - journaling code must know when a disk block is flushed to disk. It
>     may be achived with callback functions if cache subsystem supports
>     them. When the last block forming the transaction is flushed to
>     disk, the transaction considered to be completed.
>=20
> IV. OPEN ISSUES
>=20
> * Journal location: The journal can reside on the same device as the
>   rest part of file system does, but putting the journal on the
>   different device may be a performance win. E.g., Seltzer paper [3]
>   describes the approach when the journal is managed by write-ahead
>   filesystem (wafs). It may require some additional time to implement
>   the similar functionality.

Leave the journal location as a changeable thing. At the end of a file=20
system is convenient, but some configurations may need it on a separate=20
device.

> * The maximum size of a transaction is to be determined.
>   The jffs_start_transaction() function uses this information to ensure
>   that there is enough space available to hold the new transaction.

You will never have enough space. So pick something sensible, like 64M or=
=20
256M, and adapt commands that would overflow the transaction to operate in=
=20
stages.

At present the only operation I know of that will have this issue is=20
truncating a large (say multi-TB) file.=20

> * The functionality for recovering after the crashes (journal replaying)
>   need to be clearly defined and added to this document.

fsck should note the info in the super block and replay the journal.

> * Probably the list of transactions must be extended.

It will. :-)

Take care,

Bill

--LZvS9be/3tNcYl/X
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFEqVECWz+3JHUci9cRAk3JAJ92wDFiRsTVycWCAzLoaOBAQts/VACdGeXk
EOLsNL035d5ManC68iCZAsY=
=BnCX
-----END PGP SIGNATURE-----

--LZvS9be/3tNcYl/X--