tech-kern: Porting Hammerfs (fwd)

Subject: Porting Hammerfs (fwd)
To: None <tech-kern@NetBSD.org>
From: Mark Weinem <mark.weinem@alumni.uni-due.de>
List: tech-kern
Date: 12/09/2007 20:44:01
Matthew Dillon about the state of the HAMMER filesystem  and about
porting to other systems:

---------- Forwarded message ----------
[...]
:>    I'm not on the netbsd list, someone forwarded this to me, but yes,
:>    HAMMER is still under very heavy development and will probably not
:>    be considered 'complete' until mid next year.
:>
:>    It is expected to reach alpha or early beta quality by our 2.0 release
:>    in Mid january 2008.   All the primary OS interfaces are in place at
:>    this time, but large chunks of the filesystem's backend must still
:>    be written.
[...]

    Yes, please feel free to forward that, and this.

    With regards to the OS interface, HAMMER uses 16K filesystem buffers
    and it shouldn't be too hard to port that part of it.  DragonFly uses
    64 bit byte offsets for its buffer cache buffers rather than block
    numbers.  There is one big gotcha, though, and that is I am using an
    augmented softupdates callback API to allow buffer cache buffers to be
    passively associated with HAMMER's in-memory structures.  When the OS
    wants to flush or discard/reuse a buffer, it makes a callback and
    HAMMER then has the ability to tell the OS NOT to do that by setting
    B_LOCKED in the bp.  HAMMER can also decide to disassociate the bp from
    internal HAMMER structures and let the OS proceed.  In this way all
    of HAMMER's in-memory tracking structures can passively cache pointers
    into buffer cache data buffers plus also leech off of the OS's buffer
    cache management system instead of rolling its own.

    The VOP interface is a different story.  DragonFly uses a direct
    OS-level namecache locking model to resolve path components so, for
    example, instead of having VOP_LOOKUP which passes a locked directory vp
    and a name component we have VOP_NRESOLVE which passes an unlocked
    directory vp and a locked namecache handle.  MKDIR, RENAME, REMOVE, and
    so forth work the same way.  The DFly APIs are greatly simplified over
    the traditional BSD VOP APIs though so porting would not be too difficult.
    We also have a VOP_NLOOKUPDOTDOT now and filesystem code is no longer
    responsible for resolving "." or ".." (beyond the addition of that VOP).

    --

    Current state of HAMMER:  HAMMER breaks the filesystem storage up
    into 64MB 'clusters'.  Each cluster maintains a portion of the
    filesystem-wide B-Tree.  The code has progressed to the point where
    I can do most filesystem ops, including historical as-of accesses,
    within a single cluster and I am now working on the 'spike' code which
    is basically responsible for glueing the B-Tree between clusters together.

    Still TODO:

    * Spike code for glueing multiple 64MB clusters in the filesystem-wide
      B-Tree.

    * Balancing code.  It's easy to spike new clusters in for expansion but
      once the filesystem starts to get full balancing code is required
      to free up whole clusters as space is recovered.

      HAMMER is designed for very large filesystems and consequently optimal
      operation will depend on there being enough free space to absorb short
      term (e.g. ~12 hour) inefficiencies in space utilization.

    * Recovery code.  B-Tree's are per-cluster and can be reconstructed
      by scanning the cluster's record array, allowing all B-Tree ops to
      be asynchronous and for recovery to occur live on a cluster-by-cluster
      basis.  But the recovery code itself still needs to be written and
      there are some performance issues in the design which I have to work
      out (including possibly increasing the maximum cluster size to
      something much larger then 64MB so I can have a synchronous 'unsynced'
      flag in the cluster header).

    * Mirroring and backup streaming support.

    * Historical retention policy support and vacuuming code.  e.g. to be
      able to say 'retain history on 60 second boundaries for 60
      minutes, then 30 minutes boundaries for 24 hours, then 12 hour
      boundaries for a week, then ....'.

    So you can see, I still have a lot of work to do.

    At the moment I do not have any endian-neutral coding in place but the
    FS is specifically designed to allow detection of the endian mode for
    future endian-neutral operation.  It's something I very much want, but
    as you probably know endian-neutral coding takes a lot of time to do
    properly, especially when the basic functional design is still being
    coded.  So I'm holding off on that aspect of the filesystem until its
    more complete.

					-Matt
					Matthew Dillon
					<dillon@backplane.com>