tech-kern: Re: Which snapshot strategy to use? was: How to capture all file system writes (fwd)

Subject: Re: Which snapshot strategy to use? was: How to capture all file system writes (fwd)
To: NetBSD Kernel Technical Discussion <tech-kern@netbsd.org>
From: gabriel rosenkoetter <gr@eclipsed.net>
List: tech-kern
Date: 10/23/2003 18:50:23
--BOhpupldhMlYbdva
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Oct 23, 2003 at 02:03:03PM -0700, Jason Thorpe wrote:
> You want a snapshot?  Suspend the cleaner, and then provide a view of=20
> the older data (there are obviously other details to be worked out=20
> here, but you get the idea.)

Natch. Thanks. I do.

So, in fact, snapshots are *way* cheaper on LFS than FFS. Cool.

On Thu, Oct 23, 2003 at 03:02:27PM -0700, Greywolf wrote:
> Thus spake gabriel rosenkoetter ("gr> ") sometime Today...
> gr> Don't think log. Think copy-on-write into a buffer that you have
> Begging pardon, but isn't that a matter of semantics?

No, it really isn't. Copy-on-write doesn't not imply anything
resembling the same semantics as logging a series of operations.
COW means you can go read the whole block from the location you
copied it to rather than reading the original block and then
reperforming a bunch of operations on the fly before your
(blocking!) read completes.

When you're done, the copied blocks are overlaid directly over
blocks in the underlying file system. That's just a couple (massive)
write operations. Much cheaper than unrolling a log.

> And wouldn't the buffer actually have to be a working copy of
> the FS, in a way?

Yes, it's the working copy of the file system that everyone except
for the sysadmin is using as if it were the regular FS.

But it's not a full copy; initially, it's a bunch of empty space. As
write operations are performed, the blocks they wrote to are copied
into the snapshot's scratch space, typically far smaller than the
size of the underlying FS. (You have to make a really concerted
effort to write even 10% the size of a file system within the time
it takes to back the file system up to even network-attached tape.)

> How would it handle the scenario of:
>=20
> 	# freeze writes to /fs
> 	# mkdir -p /fs/tmp/foo
> 	# cp /etc/passwd /fs/tmp/foo/bar
> 	# mv /fs/tmp/foo/bar /fs/tmp/foo/grill
> 	# rm /fs/tmp/foo/grill

Probably by allocating a single COW block in the snapshot on the
mkdir call, and then only editing that block from there on out.

> Whether you're buffering fs ops, or you're keeping a log of said fs ops,
> isn't it the same?

But a snapshot doesn't imply doing either of those things, it
implies using copy-on-write semantics on a block level with disk
storage.=20

> If you don't have a full backing store of the fs, what do you do when
> your suspended write-through (for lack of a more available term) buffer
> fills before your snapshot or desired operations on the filesystem
> are finished?

Any operations reading from the underlying (untouched) file system
instantly fail. I've had Veritas snapshots drop out on me while I
was doing a vxdump. It's as if you'd disconnected the physical disk.

> Do you block writes at that point and leave the behaviour
> up to whether the writes are blocking/nonblocking?
>=20
> I have to wonder about that, since holding full backing store of a
> near-full 120G fs is not exactly your average cup of coffee.

If I ever dealt with anything so small as a 120 GB file system
(that's not even a single-shot of espresso, dude ;^>), I would
allocate at most 20 GB in a Veritas snapshot to back it up. (Most
recently, I used 90 GB for a 2 TB file system. That was fine for
twelve hours of time to spin tape on a Sun E450 running an Oracle
database. I'd rather have about 200 GB of snapshot space there,
because a join() in Oracle across significant bits of the 1.5 TB
tablespace can easily blow out 90 GB of writes; that's fairly usage-
dependent.)

Within Veritas, I've got this file system (df -k):

/dev/vx/dsk/larrydg1/F 786432000 519428696 264917376    67%    /xy/F

That's a member of a disk group that has some spare space in it:

rain:/# vxassist -g larrydg1 maxsize
Maximum volume size: 76414976 (37312Mb)

So if I want to take a snapshot of this FS, I create some scratch
space, and mount it as a snapshot of /xy/F:

rain:/# vxassist -g larrydg1 make F_snap 36g
rain:/# mount -F vxfs -o snapof=3D/xy/F /dev/vx/dsk/larrydg1/F_snap /vrt/sn=
ap

Now, /vrt/snap points at the underlying disk, on which only read
operations are possible, and /xy/F points at the scratch space:

rain:/# mount | grep F
/xy/F on /dev/vx/dsk/larrydg1/F read/write/setuid/mincache=3Dclosesync/dela=
ylog/largefiles/noatime/ioerror=3Dmwdisable/dev=3D3b01f40 on Sun Oct 19 08:=
13:37 2003
/vrt/snap on /dev/vx/dsk/larrydg1/F_snap read only/setuid/snapof=3D/xy/F/la=
rgefiles/ioerror=3Dmwdisable/dev=3D3b01f41 on Thu Oct 23 18:36:16 2003

Initially, they appear to be identical:

rain:/# df -k | grep F
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/vx/dsk/larrydg1/F 786432000 519428696 264917376    67%    /xy/F
/dev/vx/dsk/larrydg1/F_snap 786432000 519428696 264917344    67% /vrt/snap

As I write to /xy/F, /vrt/snap will remain unchanged. What's going
on behind the scenes, though, is that blocks are being copied off of
the space originally assigned to /xy/F into the 36 GB I just
allocated when I created the snapshot, and modified there. The first
write to a block beyond 36 GB (total usage, not linear, of course)
makes all the COW blocks drop through to the underlying 750 GB disk,
and any read operations coming off /vrt/snap instantly get an I/O
error.

Obviously, I do my backup off of /vrt/snap, and all my users just
bebop along using /xy/F none the wiser. (Unless they're clever;
snapshots make a very noticeable hit on I/O-bound applications, that
it's really easy to see happening with iostat(1).)

On Fri, Oct 24, 2003 at 12:03:51AM +0200, Juergen Hannken-Illjes wrote:
> FreeBSD's snapshot for FFS gives 20 persisten snapshots but to my
> knowledge it is not possible to "go back" to one of these snapshots.
> You may only read or unlink them.

Okay, that smells a whole lot like Veritas's checkpoints, except
that rather than taking up disk space outside the file system (as
with the snapshot I describe above), checkpoints take up space
inside the allocated disk, and are automatically discarded (oldest
to newest) as the file system grows to capacity. Sounds like Kirk's
snapshots are somewhere in between.

I don't think anyone would ever suggest that a snapshot should be
writeable. You can *use* them to create a detached mirror if doing
so by way of hardware or software RAID configuration isn't feasible,
but even when you're using a snapshot for testing and planning
purposes, you're writing to the scratch space, never to the
underlying file system.

I asked about there being a full-fledged volume-manager plan because
being able to allocate disk space that's part of a disk group but
not part of an extant file system is what makes snapshots such a
convenience with Veritas. It seems like NetBSD's got most of the
bits of a volume manager (ccd, raidframe), but the last step is a
non-trivial amount of code.

Would snapshots have to be taken using some other, unallocated
physical disk for sctach space in the interim? (I can see this
working well with Firewire or USB2 mass storage devices.)

--=20
gabriel rosenkoetter
gr@eclipsed.net

--BOhpupldhMlYbdva
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQE/mFsv9ehacAz5CRoRAsKPAKCNGYMfJK5lIYv5e3hXX+nBis4eGwCaAutV
u5Wfo1oc6PoLyZ8wOZD+qS4=
=9rHn
-----END PGP SIGNATURE-----

--BOhpupldhMlYbdva--