Re: getting creative with cp (was Re: cp -n diff)

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: getting creative with cp (was Re: cp -n diff)
From: der Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Mon, 9 May 2011 17:06:59 -0400 (EDT)

On Friday, I wrote

> I would prefer to have the filesystem autodetect whenever a block is
> written that is identical in content to a block already present and
> share the data.  (Yes, what I'm talking about here borders on a
> content-addressed disk layer.)

Various people made various comments, but I got curious.  So I picked a
handy filesystem (a desktop machine, configured with everything on /
and been that way for a while and thus a reasonably random mix of
current content and deleted old content on the disk).  I computed the
SHA1 checksum of each 512-byte disk block and used the resulting data
to work out how much duplication there was.

Assuming no collisions (ie, that any two blocks that hash identically
are indeed identical - probably true and certainly close enough to true
for a preliminary investigation like this one), collapsing all copies
of identical blocks would collapse away 1000474 out of 6291456 blocks.
This strikes me as enough to be significant.

As I'm sure nobody is surprised to hear, major fractions of that are a
few block contents that appear many times and a whole lot of contents
that appear only a few times.  Here are the head and tail of the
distribution.  Second column is number of copies, first column is
number of distinct block contents of which that many copies exist (the
last line is, thus, blocks which aren't duplicated at all).  All the
lines I've cut had both numbers below 1000.

   1 329499
   1 45733
   1 20654
   1 4500
   1 1014
....
2441    6
4158    5
13675    4
52445    3
361878    2
4852624    1

I had a quick look at the top repeated contents.  The most repeated
block is, to (I hope) nobody's surprise, the all-zero block.  The
others of the top five quoted above are all a single octet repeated 512
times; in the order given above, the octet values are 0xff, 0x80, 0x2e,
and 0x01.  This suggests that some savings could be achieved by just
optimizing blocks which consist of a single octet repeated - for this
kind of workload, at least.

Now I'm curious.  I'll repeat this analysis on other filesystems I have
access to.  If anyone else cares to do the same elsewhere, I'd be
interested to hear.

But not now - I have to run.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

References:
- cp -n diff
  - From: Zafer Aydoğan
- Re: cp -n diff
  - From: Joerg Sonnenberger
- Re: cp -n diff
  - From: Zafer Aydoğan
- Re: cp -n diff
  - From: Marc Balmer
- Re: cp -n diff
  - From: Marc Balmer
- getting creative with cp (was Re: cp -n diff)
  - From: David Young
- Re: getting creative with cp (was Re: cp -n diff)
  - From: der Mouse

Prev by Date: Re: New library addition
Next by Date: Filesystem Hierarchy Standard (FHS) and NetBSD
Previous by Thread: Re: getting creative with cp (was Re: cp -n diff)
Next by Thread: Re: getting creative with cp (was Re: cp -n diff)
Indexes:

Home | Main Index | Thread Index | Old Index