Subject: Re: FFS journal
To: None <pavel.cahyna@st.mff.cuni.cz,>
From: Pawel Jakub Dawidek <pjd@FreeBSD.org>
List: tech-kern
Date: 07/11/2006 16:25:17
--fdj2RfSjLxBAspz7
Content-Type: text/plain; charset=iso-8859-2
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Jul 10, 2006 at 03:05:26PM -0700, Bill Studenmund wrote:
> On Mon, Jul 10, 2006 at 10:50:17PM +0200, Pawel Jakub Dawidek wrote:
> > On Mon, Jul 10, 2006 at 01:16:38PM -0700, Bill Studenmund wrote:
> > I did some initial tests and it takes ~10 seconds to scan all cylinder
> > groups on 224GB file system.  If you increase block size from 16kB to
> > 32kB which is often the case for large file systems, it will take ~3.5
> > seconds.
> > And remember, scanning all cylinder groups is rather very rare case.
>=20
> You're right, for a few-hundred GB disk, it won't matter which way we go.
>=20
> Have you tried this for a 4 TB FS on a SAN? Extrapolating your numbers,=
=20
> that's 70 seconds. While that's alot better than hours for an fsck, that=
=20
> can be quite longer than roll-forward for the journal. That strikes me as=
=20
> sub-optimal.
>=20
> Think about how this will work for a 40 TB array...
>
> Seeking is the one part of disk technology that hasn't really changed, and
> which really isn't growing. As disks keep getting bigger and bigger, this=
=20
> will get more and more painful.

There is a global summary info right after super-block (fs_csp), where
some statistics are stored for every CG. Currently those are:

	int32_t cs_ndir;	/* number of directories */
	int32_t cs_nbfree;	/* number of free blocks */
	int32_t cs_nifree;	/* number of free inodes */
	int32_t cs_nffree;	/* number of free frags */

Adding 'cs_nunref' here could speed up things quite a lot as it will
remove seeking problem entirely. Unfortunately, this changes on-disk UFS
layout, that's why I didn't went that route.

> > I think it is acceptable and allows to avoid all those nasty VFS/UFS
> > hacks.
>=20
> What nasty hacks? I've not seen anything that needs to happen at the VFS=
=20
> layer, and the only thing we might escape at the UFS layer is "you can't=
=20
> look up '/.deleted'". Did I miss something?

For example I use special VV_DELETED flag to mark an vnode which holds
such orphaned inode, but this is needed for both cases.

Another thing is that operating on paths is tricky. For example your
file system is mounted on /foo/bar/baz. You move orphaned objects to
/foo/bar/baz/.deleted/. How do you handle situation when someone renames
'bar' to 'bar2'?
How do you handle situations when process which removed open file is
chrooted and .deleted/ is no accesable from within his context?

I also needed to introduce special garbage collector thread, which was
doing removals on last close, because unprivileged processes cannot
remove objects from .deleted/ directory (it cannot move them there as
well, by this was handled in a different way).

--=20
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd@FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!

--fdj2RfSjLxBAspz7
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (FreeBSD)

iD8DBQFEs7TNForvXbEpPzQRAudHAKDvPFPTzDDdv1oLzDWbdAFNrG8mAgCgmy7M
dNmFwOsote4ylIa1R3lVZyc=
=N0hJ
-----END PGP SIGNATURE-----

--fdj2RfSjLxBAspz7--