Subject: Re: FFS journal
To: Bill Studenmund <>
From: Pawel Jakub Dawidek <>
List: tech-kern
Date: 07/10/2006 22:50:17
Content-Type: text/plain; charset=iso-8859-2
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Jul 10, 2006 at 01:16:38PM -0700, Bill Studenmund wrote:
> On Sun, Jul 09, 2006 at 11:58:33AM +0200, Pawel Jakub Dawidek wrote:
> > On Mon, Jul 03, 2006 at 06:47:43PM +0200, Pavel Cahyna wrote:
> > > Hi,
> > >=20
> > > On Sun, Jul 02, 2006 at 07:59:50PM +0400, Kirill Kuvaldin wrote:
> > > If an application unlinks a file which is opened, the file is not del=
> > > until it is closed, until that it exists as unnamed. Now if the system
> > > crashes after the unlink and before the close, the unnamed file is not
> > > deleted and remains in the filesystem, taking up space. This is not a
> > > problem in a non-journalling scenario, because after a crash fsck is =
> > > and takes care of it. But a journalling filesystem should take this i=
> > > account.
> >=20
> > Maybe you guys will find my experience helpful. I'm working on gjournal
> > (a block level journaling) for FreeBSD and I needed to solve this
> > problem as well.
> >=20
> > My first solution to the problem was a magic .deleted/ directory, which
> > was created on mount time. Now, when an object (file or directory) was
> > removed, but still open, it wasn't really removed, but moved to
> > .deleted/ directory. On close the object is removed from this directory.
> > You need to ensure that such file/directory cannot be moved back to the
> > file system. On system crash or a power failure all you need to do is to
> > 'rm -rf .deleted' directory.
> > It worked without problems, but it wasn't really nice, so I implemented
> > another thing...
> I actually think it's a good way to go. Let's all agree on how to find=20
> this directory and just use it.
> > When an object is removed, but still open, I increase two counters:
> > 1. fs_unref - total number of unreferenced inodes in the file system
> >    (stored in file system's super-block).
> > 2. cg_unref - total number of unreferenced inodes in this cylinder
> >    group.
> > After a system crash or a power failure, I run faster fsck version,
> > which scans only cylinder groups looking of cg_unref > 0. If it finds
> > such cylinder group, it scans all its inodes looking for those with
> > linkcnt =3D=3D 0. Then, it just free all its blocks and marks it as
> > unallocated. Of course, because of the global fs_unref counter we don't
> > have to scan the whole file system, but quit scanning if fs_unref goes
> > to 0.
> The concern I have with something like this is that you're adding new cg=
> and fs_ values. The problem I see with this is that AFAICT ffs doesn't=20
> handle versioning very well. I'd rather we not add new fields if we can't=
> tell what fields are in use. :-|

That's not a problem. Those fileds are always set to 0 by older
newfs(8), which also doesn't know how to set FS_GJOURNAL flag.
On an older system, fsck will just check entire file system (and will
ignore those new fields).

On newer systems you can use older file systems without problems also,
because there is no FS_GJOURNAL flag set.

> Also, I much prefer the hidden directory idea as it directly indicates=20
> what needs cleaning. If I have two unlinked files, I'd rather not read=20
> half or 70% of the CGs to find the files to clean up. Think about life on=
> a multi-TB file system, and remember that each cg read will rpobably=20
> trigger a seek, which takes time on the order of ms.
> Put another way, chances are that each cg will have no unlinked files, so=
> a method that won't need us to read each cg will perform better.

If there are no orphaned files, fs_unref will be set to 0 and fsck will
finish immediatelly.
I know this will take more time than 'rm -rf .deleted', but .deleted is
more tricky. You need to add many special cases to be sure that an
object cannot be moved back to the file system from .deleted directory,
to be sure that an object cannot be opened, to be sure that you cannot
create a file in deleted directory, etc.

I did some initial tests and it takes ~10 seconds to scan all cylinder
groups on 224GB file system.  If you increase block size from 16kB to
32kB which is often the case for large file systems, it will take ~3.5
And remember, scanning all cylinder groups is rather very rare case.

I think it is acceptable and allows to avoid all those nasty VFS/UFS

Pawel Jakub Dawidek                              
FreeBSD committer                         Am I Evil? Yes, I Am!

Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.4.2 (FreeBSD)