Subject: Porting Hammerfs (fwd)
To: None <tech-kern@NetBSD.org>
From: Mark Weinem <email@example.com>
Date: 12/09/2007 20:44:01
Matthew Dillon about the state of the HAMMER filesystem and about
porting to other systems:
---------- Forwarded message ----------
:> I'm not on the netbsd list, someone forwarded this to me, but yes,
:> HAMMER is still under very heavy development and will probably not
:> be considered 'complete' until mid next year.
:> It is expected to reach alpha or early beta quality by our 2.0 release
:> in Mid january 2008. All the primary OS interfaces are in place at
:> this time, but large chunks of the filesystem's backend must still
:> be written.
Yes, please feel free to forward that, and this.
With regards to the OS interface, HAMMER uses 16K filesystem buffers
and it shouldn't be too hard to port that part of it. DragonFly uses
64 bit byte offsets for its buffer cache buffers rather than block
numbers. There is one big gotcha, though, and that is I am using an
augmented softupdates callback API to allow buffer cache buffers to be
passively associated with HAMMER's in-memory structures. When the OS
wants to flush or discard/reuse a buffer, it makes a callback and
HAMMER then has the ability to tell the OS NOT to do that by setting
B_LOCKED in the bp. HAMMER can also decide to disassociate the bp from
internal HAMMER structures and let the OS proceed. In this way all
of HAMMER's in-memory tracking structures can passively cache pointers
into buffer cache data buffers plus also leech off of the OS's buffer
cache management system instead of rolling its own.
The VOP interface is a different story. DragonFly uses a direct
OS-level namecache locking model to resolve path components so, for
example, instead of having VOP_LOOKUP which passes a locked directory vp
and a name component we have VOP_NRESOLVE which passes an unlocked
directory vp and a locked namecache handle. MKDIR, RENAME, REMOVE, and
so forth work the same way. The DFly APIs are greatly simplified over
the traditional BSD VOP APIs though so porting would not be too difficult.
We also have a VOP_NLOOKUPDOTDOT now and filesystem code is no longer
responsible for resolving "." or ".." (beyond the addition of that VOP).
Current state of HAMMER: HAMMER breaks the filesystem storage up
into 64MB 'clusters'. Each cluster maintains a portion of the
filesystem-wide B-Tree. The code has progressed to the point where
I can do most filesystem ops, including historical as-of accesses,
within a single cluster and I am now working on the 'spike' code which
is basically responsible for glueing the B-Tree between clusters together.
* Spike code for glueing multiple 64MB clusters in the filesystem-wide
* Balancing code. It's easy to spike new clusters in for expansion but
once the filesystem starts to get full balancing code is required
to free up whole clusters as space is recovered.
HAMMER is designed for very large filesystems and consequently optimal
operation will depend on there being enough free space to absorb short
term (e.g. ~12 hour) inefficiencies in space utilization.
* Recovery code. B-Tree's are per-cluster and can be reconstructed
by scanning the cluster's record array, allowing all B-Tree ops to
be asynchronous and for recovery to occur live on a cluster-by-cluster
basis. But the recovery code itself still needs to be written and
there are some performance issues in the design which I have to work
out (including possibly increasing the maximum cluster size to
something much larger then 64MB so I can have a synchronous 'unsynced'
flag in the cluster header).
* Mirroring and backup streaming support.
* Historical retention policy support and vacuuming code. e.g. to be
able to say 'retain history on 60 second boundaries for 60
minutes, then 30 minutes boundaries for 24 hours, then 12 hour
boundaries for a week, then ....'.
So you can see, I still have a lot of work to do.
At the moment I do not have any endian-neutral coding in place but the
FS is specifically designed to allow detection of the endian mode for
future endian-neutral operation. It's something I very much want, but
as you probably know endian-neutral coding takes a lot of time to do
properly, especially when the basic functional design is still being
coded. So I'm holding off on that aspect of the filesystem until its