Subject: svn disk usage: where we are and where we're going
To: None <tech-repository@NetBSD.org>
From: Eric Gillespie <epg@NetBSD.org>
List: tech-repository
Date: 09/13/2007 10:45:39
We've had a lot of chatter elsewhere about svn disk usage and its
impact on a potential NetBSD switch to svn.  For those who have
no idea what I'm talking about, I'll briefly summarize the
problem.  Then I'll discuss ways to mitigate that, in 1.4 and the
upcoming 1.5, and how the problem will actually be solved in the
next "major" release (likely 1.7; I'm arguing in favor of a
smaller, incremental 1.6).

The problem
===========

For every file in your working copy, Subversion stores a second,
pristine copy of this file, called the text-base.  This file is
uncompressed (unless you checked in a compressed file, of course)
and untransformed, i.e. no keyword expansion, newline
translation, and free of your local edits.  This file is used to
allow off-line svn diff as well as to transmit deltas on commit,
where CVS sends full texts on commit, and deltas only on update.
I won't go into why an rsync-type algorithm is not really a good
fit for svn.

Coping
======

NetBSD developers seem to be concerned about this space penalty
from two angles; correct me if I have missed something.  First is
the impact on day-to-day development work.  And second is the
impact on keeping old and slow ports going.

This is controversial among NetBSD circles, but I fall firmly
into the "for the love of God don't use that ancient SPARC as
your daily dev box" camp.  I agree that keeping NetBSD running,
self-hosting, and self-testing on old and slow architectures is
valuable.  But I have no sympathy for people actually trying to
develop on these systems.  The tax is just not worth paying.

That said, there is a trick you can use to make portions of the
tree you don't care about go away.  You need an empty directory
somewhere in the repository (most projects keep one specially for
this, so no one will put things into it).  You can then run 'svn
switch ${repo}/empty gnu/dist' to dump gnu/dist.  It won't come
back on subsequent updates; if you want it back, svn switch it
back to the URL for gnu/dist.

Subversion 1.5 (in the home stretch now; look for it in the next
couple months) includes a new --depth option to commands such as
checkout and update.  It lets you exclude trees from your working
copy without such silly hacks as the above.  In a future release,
svn may grow something like Perforce client specs to manage this.

Finally, if we are using old and slow platforms only to build and
test themselves, we don't need working copies at all.  The svn
export command is just like checkout, except it never creates the
meta-data and text-bases.  Of course, as Perry points out, these
ports can use NFS-exported source trees.

The future
==========

NetBSD is not the first project that doesn't want to pay the
text-base tax, not by far.  The working copy library is widely
regarded as crap.  It's the oldest part of Subversion, and
suffers dreadfully from organic growth.  It also suffers from
design flaws such as scattering the meta-data throughout the
working copy (which model was copied from CVS without thinking
through the implications; oops) and forcing the additional
text-base on the user.

So, basically, this is not going to be fixed except by rewriting
the working copy.  The issue has been simmering for a while now,
and we seem to have broad consensus that this should be the next
big feature, after merge-tracking is solid (1.5 and 1.6).  My
team at Google will be working on this, and I think a few other
committers will be working on it as well.  I wouldn't expect a
release with this rewrite before the end of 2008, though.

The UI for checking out without text-bases is obviously quite a
ways off from being decided, but it will probably be something
like a --no-text-bases option to checkout and a new svn edit
command to create the text base.  It would be premature to go
into any real detail now.

-- 
Eric Gillespie <*> epg@NetBSD.org