tech-kern: Re: FFS journal

Subject: Re: FFS journal
To: Pawel Jakub Dawidek <pjd@FreeBSD.org>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 07/10/2006 15:05:26
--h31gzZEtNLTqOjlF
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Jul 10, 2006 at 10:50:17PM +0200, Pawel Jakub Dawidek wrote:
> On Mon, Jul 10, 2006 at 01:16:38PM -0700, Bill Studenmund wrote:
> > On Sun, Jul 09, 2006 at 11:58:33AM +0200, Pawel Jakub Dawidek wrote:
> > The concern I have with something like this is that you're adding new c=
g=20
> > and fs_ values. The problem I see with this is that AFAICT ffs doesn't=
=20
> > handle versioning very well. I'd rather we not add new fields if we can=
't=20
> > tell what fields are in use. :-|
>=20
> That's not a problem. Those fileds are always set to 0 by older
> newfs(8), which also doesn't know how to set FS_GJOURNAL flag.
> On an older system, fsck will just check entire file system (and will
> ignore those new fields).
>=20
> On newer systems you can use older file systems without problems also,
> because there is no FS_GJOURNAL flag set.

You're only thinking of the ffs in FreeBSD. For just FreeBSD, what you=20
describe is fine. However it'd be nice to finally fix the versioning=20
issues.

> > Put another way, chances are that each cg will have no unlinked files, =
so=20
> > a method that won't need us to read each cg will perform better.
>=20
> If there are no orphaned files, fs_unref will be set to 0 and fsck will
> finish immediatelly.

True, but if there are no orphaned files, it doesn't mater what we do. :-)

> I know this will take more time than 'rm -rf .deleted', but .deleted is
> more tricky. You need to add many special cases to be sure that an
> object cannot be moved back to the file system from .deleted directory,
> to be sure that an object cannot be opened, to be sure that you cannot
> create a file in deleted directory, etc.

Is this really difficult? Once we add something so that you can't do a=20
path lookup to this directory, do we really have that much work left?=20
While I agree there are cases to look into, I think we need to catch them=
=20
for NFS (a client could have a file handle for the detached object and=20
could send in requests). So we need the checks regardless of using=20
".deleted" or the counts.

Also, since we're talking about extending the superblock, we can just=20
shove the inode of the directory that contains the orphans into it and=20
then we have no path lookup options at all.

> I did some initial tests and it takes ~10 seconds to scan all cylinder
> groups on 224GB file system.  If you increase block size from 16kB to
> 32kB which is often the case for large file systems, it will take ~3.5
> seconds.
> And remember, scanning all cylinder groups is rather very rare case.

You're right, for a few-hundred GB disk, it won't matter which way we go.

Have you tried this for a 4 TB FS on a SAN? Extrapolating your numbers,=20
that's 70 seconds. While that's alot better than hours for an fsck, that=20
can be quite longer than roll-forward for the journal. That strikes me as=
=20
sub-optimal.

Think about how this will work for a 40 TB array...

Seeking is the one part of disk technology that hasn't really changed, and
which really isn't growing. As disks keep getting bigger and bigger, this=
=20
will get more and more painful.

It strikes me that anything that goes as O(# files) is much better than=20
something that goes as O(# disk blocks, which #CG will go as).

> I think it is acceptable and allows to avoid all those nasty VFS/UFS
> hacks.

What nasty hacks? I've not seen anything that needs to happen at the VFS=20
layer, and the only thing we might escape at the UFS layer is "you can't=20
look up '/.deleted'". Did I miss something?

Take care,

Bill

--h31gzZEtNLTqOjlF
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFEss8mWz+3JHUci9cRApqhAJ0Ynkg0QwtiJh+wNVWgk1y1tSrXHwCeKfZG
jfdhlMaVT5IbZCVBpOE9m38=
=S2m9
-----END PGP SIGNATURE-----

--h31gzZEtNLTqOjlF--