Subject: Re: SoC project proposal
To: Bill Stouder-Studenmund <wrstuden@netbsd.org>
From: haad <haaaad@gmail.com>
List: tech-kern
Date: 03/18/2007 13:10:21
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bill Stouder-Studenmund wrote:
> On Sat, Mar 17, 2007 at 12:19:39AM +0100, haad wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Hi
>>
>> I have written this project proposal for this year's summer of code. You can
>> read it here
>>
>> http://wasabi.fiit.stuba.sk/~haad/netbsd/soc_pro_ext3.html
>>
>>
>> Here is short version: <<EOF
>> General
>>
>> The Ext2 file system is the de-facto standard, Unix-like file system used on
>> Linux installations. Ext2 does not have journaling capabilities, so Ext3 was
>> built on top of it to add them without breaking compatibility with Ext2. Ext3 is
>> now a stable journaled file system used on lots of Linux installations.
>>
>> NetBSD currently fully supports the Ext2 file system at the kernel level.
>> Unfortunately there is no support for the new features included in Ext3,
>> although Ext3 file systems can be mounted provided that their journal is clean.
>> It would be very nice if NetBSD had Ext3 file system support because the system
>> could immediately gain a journaled file system as well as compatibility with Linux.
>>
>> NetBSD as operating system really need good, stable journal file system, today
>> disks and raids become more and more bigger with size about 1TB or more. FFS was
>> not designed for disks size like this. We have problems with file system sizes
>> over 2TB (nor FFS or FFS2 is suitable for this size) good ext3/ext4 support will
>> give away these problems.
> 
> Note: this is not correct. While I do not question the idea that it could 
> be EXCRUCIATINGLY PAINFUL to use either an ffs1 or ffs2 file system for a 
> multi-TB file system, it can be done.
> 
> ffs1 supports 2^31 fs blocks. These are what everyone calls fragments. So
> a 1k fragment size ffs can support 2 TB. A 4k fragment can support 8 TB,
> and so on.
> 
> Changing the block pointers to 64-bit numbers was one of the main points 
> of ffs2/ufs2. So many-TB support was the point.
> 
>> EXT3 file system features:
>>
>>     *
>>
>>       Journaling
>>     *
>>
>>       Over 16TB file system size
>>
>>    1.
>>
>>       Journaling
>>
>>       In a nutshell, the journal in ext3fs meaning is a regular file which
>> stores whole metadata (and optionally data) blocks that have been modified,
>> prior to writing them into the filesystem. This means it is possible to add a
>> journal to an existing ext2 file system without the need for data conversion.
>>
>>       When changes to the filesystem (e.g. a file is renamed) they are stored in
>> a transaction in the journal and can either be complete or incomplete at the
>> time of a crash. If a transaction is complete at the time of a crash (or in the
>> normal case where the system does not crash), then any blocks in that
>> transaction are guaranteed to represent a valid filesystem state, and are copied
>> into the filesystem. If a transaction is incomplete at the time of the crash,
>> then there is no guarantee of consistency for the blocks in that transaction so
>> they are discarded
>>    2.
>>
>>       Availability
>>
>>       By contrast, ext3 does not require a file system check, even after an
>> unclean system shutdown, except for certain rare hardware failure cases (e.g.
>> hard drive failures). This is because the data is written to disk in such a way
>> that the file system is always consistent. The time to recover an ext3 file
>> system after an unclean system shutdown does not depend on the size of the file
>> system or the number of files; rather, it depends on the size of the "journal"
>> used to maintain consistency.
>>    3.
>>
>>       Data Integrity
>>
>>       Using the ext3 file system can provide stronger guarantees about data
>> integrity in case of an unclean system shutdown. You choose the type and level
>> of protection that your data receives. You can choose to keep the file system
>> consistent, but allow for damage to data on the file system in the case of
>> unclean system shutdown; this can give a modest speed up under some but not all
>> circumstances. Alternatively, you can choose to ensure that the data is
>> consistent with the state of the file system; this means that you will never see
>> garbage data in recently-written files after a crash.
>>
>> Linux use journal block device to manage journals for their filesystems like
>> ext3... . I think that NetBSD need something similar to Linux's JBD(Journal
>> block device).
>>
>> Journal block device
>>
>> Linux use for journaling JBD Journal Block Device. JBD provides atomicity in
>> operations. It was design to add journaling capabilities on a block device. The
>> ext3 filesystem code will inform the JBD of modifications it is performing
>> (called a transaction). he journal supports the transactions start and stop, and
>> in case of crash, the journal can replayed the transactions to put the partition
>> back in a consistent state fast.
>>
>> Good journal API can be used in our non journaled filesystems e.g ffs. Main goal
>> of my Soc project should be design and implementation of good journal API and
>> then implement ext3fs support.
> 
> Sounds good.

Great :).
> 
>> JBD API is used to open,load,commit and administer journal transactions on
>> device. In Linux JBD is defined in fs/jbd/ and include/linux/jbd.h.
>>
>> JBD use these objects in their API: handle,transaction,journal.
>>
>>    1.
>>
>>       Handle is single atomic update on filesystem. Handle is a group of
>> writes/updates on disk that should be performed atomically.
>>    2.
>>
>>       Handles can be stored in groups called transactions. Only transactions are
>> flushed to journal. Transaction is atomicity in nature because consists only
>> from atomic handles. When transaction is being committed it can have these states:
>>          1.
>>
>>             Running: the transaction currently is live and can accept new
>> handles. In a system only one transaction can be in the running state.
> 
> I'll want to see how things develop, but this could be a bottle neck 
> eventually. If I understand you correctly, this means that only one thread 
> in the file system can update metadata (or real data) at once. I don't 
> like that idea. However, chances are that this is a fine assumption to 
> start with.
I can look at this later ,when it will work with one thread. I will keep
multiple thread option in mind when I will design/code this API.

> 
>>          2.
>>
>>             Locked: the transaction does not accept any new handles but existing
>> handles are not complete. Once all the existing handles are completed, the
>> transaction goes to the next state.
>>          3.
>>
>>             Flush: all the handles in a transaction are complete. The
>> transaction is writing itself to the journal.
>>          4.
>>
>>             Commit: the entire transaction log has been written to the journal.
>> The transaction is writing a commit block indicating that the transaction log in
>> the journal is complete.
>>          5.
>>
>>             Finished: the transaction is written completely to the journal. It
>> has to remain there until the blocks are updated to the actual locations on the
>> disk.
>>
>>
>> Extending our ext2fs support
>>
>> Our ext2fs implementation is located src/sys/ufs/ext2fs/. I will use this path
>> when explicitly define another path. For linux paths I implicitly mean
>> /usr/src/linux/fs/ext3/ path.
>> Ext3fs SuperBlock
>>
>> I have to extend our super block structure defined in ext2fs.h to support ext3fs
>> journal options used. In our superblock structure there is padding included
>> which can be used for adding new features.
>> Also struct m_ext2fs need to have a least new journal mounted flag. If we want
>> EXT3 ACL support structures for struct ext3_acl_header,struct ext3_acl_entry are
>> needed.
> 
> Just to be clear, we sill be adding the exact same features that normal 
> ext3fs has, correct?

Yes AFAIK ACL are not essential for using ext3fs.


>> Journal
>>
>> A journal is a log that internally manages updates for a single block device.
>> The updates first are stored in the journal and then are reflected to their real
>> locations on the disk. The area belonging to the journal is managed like a
>> circular-linked list. That is, the journal reuses its area when the journal is full.
>>
>> User land part
>>
>> I have to write usable BSD license mke2fs program, and e2fsck if we want to use
>> ext3 file system without additional packages from pkgsrc. Here I will also write
>> new or extend our mount_e2fs to support journaling.
> 
> This is not correct. While we would PREFER a BSD-licensed set of tools, we 
> can use GPL'd tools if needed. I mainly mention this as the other aspects 
> of this project NEED to happen in the SoC time frame, while this can be 
> cleaned up later.

Sorry for this, I will remove this part from my project proposal.

>> Documentation
>>
>> Write good documentation about development process so other developers can use,
>> include it to NetBSD internals book.
>>
>> EOF
>>
>> I'm working on this proposal now ,so it's work in progress now , but I want to
>> discuss this project here.
> 
> Take care,
> 
> Bill

Regards
- ---------------------------------------------------------------
Adam Hamsik
ICQ 249727910
jabber haad@jabber.org
- ---------------------------------------------------------------
There are 10 kinds of people in the world. Those who understand
binary numbers, and those who don't.
				
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (NetBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF/Swt9Wt2FT7y228RAgUYAKCFD94pT6UVn2kCdR0DPpuOKKpKvQCgkZRU
O9dhuOrIZ0XZQm0pF1aIJww=
=qZ8I
-----END PGP SIGNATURE-----