Subject: Re: SoC project proposal
To: haad <haaaad@gmail.com>
From: Bill Stouder-Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 03/17/2007 18:59:27
--J/dobhs11T7y2rNN
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Mar 17, 2007 at 12:19:39AM +0100, haad wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>=20
> Hi
>=20
> I have written this project proposal for this year's summer of code. You =
can
> read it here
>=20
> http://wasabi.fiit.stuba.sk/~haad/netbsd/soc_pro_ext3.html
>=20
>=20
> Here is short version: <<EOF
> General
>=20
> The Ext2 file system is the de-facto standard, Unix-like file system used=
 on
> Linux installations. Ext2 does not have journaling capabilities, so Ext3 =
was
> built on top of it to add them without breaking compatibility with Ext2. =
Ext3 is
> now a stable journaled file system used on lots of Linux installations.
>=20
> NetBSD currently fully supports the Ext2 file system at the kernel level.
> Unfortunately there is no support for the new features included in Ext3,
> although Ext3 file systems can be mounted provided that their journal is =
clean.
> It would be very nice if NetBSD had Ext3 file system support because the =
system
> could immediately gain a journaled file system as well as compatibility w=
ith Linux.
>=20
> NetBSD as operating system really need good, stable journal file system, =
today
> disks and raids become more and more bigger with size about 1TB or more. =
FFS was
> not designed for disks size like this. We have problems with file system =
sizes
> over 2TB (nor FFS or FFS2 is suitable for this size) good ext3/ext4 suppo=
rt will
> give away these problems.

Note: this is not correct. While I do not question the idea that it could=
=20
be EXCRUCIATINGLY PAINFUL to use either an ffs1 or ffs2 file system for a=
=20
multi-TB file system, it can be done.

ffs1 supports 2^31 fs blocks. These are what everyone calls fragments. So
a 1k fragment size ffs can support 2 TB. A 4k fragment can support 8 TB,
and so on.

Changing the block pointers to 64-bit numbers was one of the main points=20
of ffs2/ufs2. So many-TB support was the point.

> EXT3 file system features:
>=20
>     *
>=20
>       Journaling
>     *
>=20
>       Over 16TB file system size
>=20
>    1.
>=20
>       Journaling
>=20
>       In a nutshell, the journal in ext3fs meaning is a regular file which
> stores whole metadata (and optionally data) blocks that have been modifie=
d,
> prior to writing them into the filesystem. This means it is possible to a=
dd a
> journal to an existing ext2 file system without the need for data convers=
ion.
>=20
>       When changes to the filesystem (e.g. a file is renamed) they are st=
ored in
> a transaction in the journal and can either be complete or incomplete at =
the
> time of a crash. If a transaction is complete at the time of a crash (or =
in the
> normal case where the system does not crash), then any blocks in that
> transaction are guaranteed to represent a valid filesystem state, and are=
 copied
> into the filesystem. If a transaction is incomplete at the time of the cr=
ash,
> then there is no guarantee of consistency for the blocks in that transact=
ion so
> they are discarded
>    2.
>=20
>       Availability
>=20
>       By contrast, ext3 does not require a file system check, even after =
an
> unclean system shutdown, except for certain rare hardware failure cases (=
e.g.
> hard drive failures). This is because the data is written to disk in such=
 a way
> that the file system is always consistent. The time to recover an ext3 fi=
le
> system after an unclean system shutdown does not depend on the size of th=
e file
> system or the number of files; rather, it depends on the size of the "jou=
rnal"
> used to maintain consistency.
>    3.
>=20
>       Data Integrity
>=20
>       Using the ext3 file system can provide stronger guarantees about da=
ta
> integrity in case of an unclean system shutdown. You choose the type and =
level
> of protection that your data receives. You can choose to keep the file sy=
stem
> consistent, but allow for damage to data on the file system in the case of
> unclean system shutdown; this can give a modest speed up under some but n=
ot all
> circumstances. Alternatively, you can choose to ensure that the data is
> consistent with the state of the file system; this means that you will ne=
ver see
> garbage data in recently-written files after a crash.
>=20
> Linux use journal block device to manage journals for their filesystems l=
ike
> ext3... . I think that NetBSD need something similar to Linux's JBD(Journ=
al
> block device).
>=20
> Journal block device
>=20
> Linux use for journaling JBD Journal Block Device. JBD provides atomicity=
 in
> operations. It was design to add journaling capabilities on a block devic=
e. The
> ext3 filesystem code will inform the JBD of modifications it is performing
> (called a transaction). he journal supports the transactions start and st=
op, and
> in case of crash, the journal can replayed the transactions to put the pa=
rtition
> back in a consistent state fast.
>=20
> Good journal API can be used in our non journaled filesystems e.g ffs. Ma=
in goal
> of my Soc project should be design and implementation of good journal API=
 and
> then implement ext3fs support.

Sounds good.

> JBD API is used to open,load,commit and administer journal transactions on
> device. In Linux JBD is defined in fs/jbd/ and include/linux/jbd.h.
>=20
> JBD use these objects in their API: handle,transaction,journal.
>=20
>    1.
>=20
>       Handle is single atomic update on filesystem. Handle is a group of
> writes/updates on disk that should be performed atomically.
>    2.
>=20
>       Handles can be stored in groups called transactions. Only transacti=
ons are
> flushed to journal. Transaction is atomicity in nature because consists o=
nly
> from atomic handles. When transaction is being committed it can have thes=
e states:
>          1.
>=20
>             Running: the transaction currently is live and can accept new
> handles. In a system only one transaction can be in the running state.

I'll want to see how things develop, but this could be a bottle neck=20
eventually. If I understand you correctly, this means that only one thread=
=20
in the file system can update metadata (or real data) at once. I don't=20
like that idea. However, chances are that this is a fine assumption to=20
start with.

>          2.
>=20
>             Locked: the transaction does not accept any new handles but e=
xisting
> handles are not complete. Once all the existing handles are completed, the
> transaction goes to the next state.
>          3.
>=20
>             Flush: all the handles in a transaction are complete. The
> transaction is writing itself to the journal.
>          4.
>=20
>             Commit: the entire transaction log has been written to the jo=
urnal.
> The transaction is writing a commit block indicating that the transaction=
 log in
> the journal is complete.
>          5.
>=20
>             Finished: the transaction is written completely to the journa=
l. It
> has to remain there until the blocks are updated to the actual locations =
on the
> disk.
>=20
>=20
> Extending our ext2fs support
>=20
> Our ext2fs implementation is located src/sys/ufs/ext2fs/. I will use this=
 path
> when explicitly define another path. For linux paths I implicitly mean
> /usr/src/linux/fs/ext3/ path.
> Ext3fs SuperBlock
>=20
> I have to extend our super block structure defined in ext2fs.h to support=
 ext3fs
> journal options used. In our superblock structure there is padding includ=
ed
> which can be used for adding new features.
> Also struct m_ext2fs need to have a least new journal mounted flag. If we=
 want
> EXT3 ACL support structures for struct ext3_acl_header,struct ext3_acl_en=
try are
> needed.

Just to be clear, we sill be adding the exact same features that normal=20
ext3fs has, correct?

> Journal
>=20
> A journal is a log that internally manages updates for a single block dev=
ice.
> The updates first are stored in the journal and then are reflected to the=
ir real
> locations on the disk. The area belonging to the journal is managed like a
> circular-linked list. That is, the journal reuses its area when the journ=
al is full.
>=20
> User land part
>=20
> I have to write usable BSD license mke2fs program, and e2fsck if we want =
to use
> ext3 file system without additional packages from pkgsrc. Here I will als=
o write
> new or extend our mount_e2fs to support journaling.

This is not correct. While we would PREFER a BSD-licensed set of tools, we=
=20
can use GPL'd tools if needed. I mainly mention this as the other aspects=
=20
of this project NEED to happen in the SoC time frame, while this can be=20
cleaned up later.

> Documentation
>=20
> Write good documentation about development process so other developers ca=
n use,
> include it to NetBSD internals book.
>=20
> EOF
>=20
> I'm working on this proposal now ,so it's work in progress now , but I wa=
nt to
> discuss this project here.

Take care,

Bill

--J/dobhs11T7y2rNN
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (NetBSD)

iD8DBQFF/KsPWz+3JHUci9cRAn7hAKCV2epVJP4T/CgBLzaw7lBdR5VoeACZAROL
FwCLUEPE1LqgLtKdMhMT6UA=
=z75S
-----END PGP SIGNATURE-----

--J/dobhs11T7y2rNN--