tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: ffs snapshots patch

On Sat, Apr 16, 2011 at 09:29:26PM +0200, Manuel Bouyer wrote:
> Hello,
> attached is a work in progress on ffs snapshot (as it's work in progress,
> some debug and instrumentation code is still present in the
> patch, no need to comment on this part :).
> The start of this work is that when working on quota, I noticed that
> taking a snapshot on a 500Gb filesystem needs several minutes, and is
> O(n) with the number of persisent snapshots.
> Here's some timings on a otherwise idle 500Gb filesystem (it's some brand of
> SATA2 3.5" drive attached to a AHCI controller, so it's a reasonable test
> bed for today):
> java# /usr/bin/time fssconfig fss0 /home /home/snaps/snap0
>       260.53 real         0.00 user         1.15 sys
> /home: suspended 77.873 sec, redo 1184 of 2556
> java# /usr/bin/time fssconfig fss1 /home /home/snaps/snap1
>       377.87 real         0.00 user         2.53 sys
> /home: suspended 206.078 sec, redo 1184 of 2556
> java# /usr/bin/time fssconfig fss2 /home /home/snaps/snap2
>       508.23 real         0.00 user         4.28 sys
> /home: suspended 338.534 sec, redo 1184 of 2556
> java# /usr/bin/time fssconfig fss3 /home /home/snaps/snap3
>       621.40 real         0.00 user         5.50 sys
> /home: suspended 431.154 sec, redo 1183 of 2556
> suspending a filesystem for more than 7mn to take a snapshot makes
> persisent snapshot quite useless to me. I wonder how it would behaves
> on a multi-terabyte filesystem.
> I looked at where the time is spend and found 2 major issues:
> 1 cgaccount() works in 2 pass: first it copies cg before suspending the
>   filesystem; then it is called again to copy only the cg that have been
>   modified between copy and filesystem suspend.
>   The problem is that to copy a cg we need to allocate blocks for the snapshot
>   file, which may be in a cg we just copied. This is the cause of the high
>   number of cg copies (almost half of them) with the filesystem suspended.
> 2 while the filesystem is suspended, we want to expunge the snapshot files
>   from the snapshot view (make them appear as a 0-length file).
>   With ~500GB sparse files this is a lot of work.
> I fixed 1) by preallocating needed blocks snapshot_setup(). 

Good catch.  Committed.

> Fixing 2) is trickier. To avoid the heavy writes to the snapshot file
> with the fs suspended, the snapshot appears with its real lenght and
> blocks at the time of creation, but is marked invalid (only the
> inode block needs to be copied, and this can be done before suspending
> the fs). Now BLK_SNAP should never be seen as a block number, and we skip
> ffs_copyonwrite() if the write is to a snapshot inode.

I strongly object here.  There are good reasons to expunge old snapshots.

Even it it were done right, without deadlocks and locking-against-self,
the resulting snapshot looses at least two properties:

- A snapshot is considered stable.  Whenever you read a block you get
  the same contents.  Allowing old snapshots to exist but not running
  copy-on-write means these blocks will change their contents.

- A snapshot will fsck clean.  It is impossible to change fsck_ffs
  to check a snapshot as these old snapshots indirect blocks now will
  contain garbage.

You cannot copy blocks before suspension without rewriting them once
the file system is suspended.

The check in ffs_copyonwrite() will only work as long as the old
snapshot exists.  As sson as it gets removed we will run COW
on the blocks used by the old snapshot.

> With these changes the times are much more reasonable:
> /usr/bin/time fssconfig fss0 /home /home/snaps/snap0
>       299.68 real         0.00 user         1.10 sys
> /home: suspended 0.310 sec, redo 0 of 2556
> /usr/bin/time fssconfig fss1 /home /home/snaps/snap1
>       188.10 real         0.00 user         0.86 sys
> /home: suspended 0.270 sec, redo 0 of 2556
> /usr/bin/time fssconfig fss2 /home /home/snaps/snap2
>       169.78 real         0.00 user         0.95 sys
> /home: suspended 0.450 sec, redo 0 of 2556
> /usr/bin/time fssconfig fss3 /home /home/snaps/snap3
>       172.39 real         0.00 user         0.99 sys
> /home: suspended 0.300 sec, redo 0 of 2556
> This seems to work; one issue with this patch is that the block
> count for the snapshot inode, and block summary informations (the
> second being probably a consequence of the first) appear wrong when
> running fsck against a snapshot.  I believe this is fixable, but
> I've not yet found from where the information mismatch is coming from.
> comments ?
> PS: I'm away from computers for one week, so don't expect replies to
> your comments before next sunday.
> -- 
> Manuel Bouyer <>
>      NetBSD: 26 ans d'experience feront toujours la difference
> --

Juergen Hannken-Illjes - - TU Braunschweig 

Home | Main Index | Thread Index | Old Index