tech-repository archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Repository conversions



I've taken some time to try and convert pkgsrc and src into mercurial,
and this is a kind of braindump of what happened, lessons learned,
etc.

The goal was to produce a mercurial mirror of src and pkgsrc which I
could use personally. The challenge was to get there, armed with only
a copy of the CVS repo, and to exercise various conversion tools in
the process.

The current status is this - as of March 28th, 2014, I have mercurial,
git, fossil (and cvs) copies of the original NetBSD cvs repos for src
and pkgsrc.  The reasons for having so many different types of repos
will become apparent later on.  The sizes for the src repos are as
follows:

        26G     src.hg2
        15G     src.hg1
        2.1G    src.hg
        6.8G    src.git
        5.8G    src-20140328.fossil
        9.8G    repo/src (cvs)

The reason for the three mercurial repos will become apparent later.


1. Straight "hg convert" from cvs to mercurial

Having taken a copy of the ,v files in the pkgsrc cvs repo, I started
out with a straight "hg convert" of the cvs repo.  This had numerous
interesting finds:

+ the conversion process uses cvs2ps, and then parses the commit logs. 
Unfortunately, there are 2 commits to dcraw which quote the upstream
cvs commit log verbatim, and the "hg convert" process thinks that this
is the start of a new change, gets weirded out by the unfamiliar rev
log number, and aborts.

+ i also had to take the rcs files for openjade and crimson from my
conversion process, or hg convert aborted there too

+ pkgsrc/sysutils/user/files/Attic/md5,v is a zombie that won't die,
and also aborted the conversion.  This is my own fault, as I wrote the
code :-) This one was also fixed manually.

After over 3 days, I finally had a converted hg repo; the unfortunate
thing about this was that only about half of the files were present,
either in a cloned repo, or a working copy. This wasn't usable in any
way, so time to try another approach to conversion.


2. reposurgeon

I set up reposurgeon after the previous failure.  This was more
promising, but fairly heavy on resources - the process would eat up
12-16 MB of memory every five seconds.  By adding swapfiles with
swapctl I got up to 20GB of swap, and reposurgeon was still eating it
at the same rate.  After the process died overnight when I wasn't
around to add even more swap, I decided there must be a different
approach. In retrospect, this may be an artefact of the repo size
discovered later on, but without any information from reposurgeon,
it's difficult to tell.


3. cvs to fossil to git to hg

The method that worked for me in the end, is thanks to Joerg, and
basically does a conversion through every conceivable DVCS to get to a
usable mercurial repo (oh, bazaar, damn).  Joerg has written up all
the ins and outs of his conversion process, and makes his git mirrors
available for everyone to use:

        https://blog.netbsd.org/tnf/entry/fossil_and_git_mirrors_of
        https://github.com/jsonn/pkgsrc
        https://github.com/jsonn/src

and the fossil files:

        http://ftp.netbsd.org/pub/NetBSD/misc/repositories/fossil/

and his sources for the cvs2fossil conversion are discussed here:

        http://www.sonnenberger.org/2011/05/12/may-update-cvs2fossil/

So I used the same method that joerg used (not optimised to miss out
the top-level directory), and after a day of fossil and git
fast-import, I too had git repos I could use.  The theory was that the
path from cvs to git is much more well-trodden than from cvs to
mercurial, and so bugs are more likely to have been zapped by others
along the way.  And I thank them for that.  On the bright side, all of
the converted files seemed to be in the resulting git repo.

The "hg convert" from git to mercurial seemed to go much quicker than
the one from cvs to hg, at a rough estimate around 10000 changesets
per hour.  About a day later I had a pkgsrc.hg repo.  It was 26GB in
size.  David Holland kindly pointed me towards mpm, and he gave me
some guidance on some ways to reduce the size of the mercurial repo,
straight from the horse's mouth. The way we've grown the cvs repo
over the years brings out the worst in mercurial's space management,
and the --datesort switch I used when converting from git to hg also
seemed suboptimal. mpm also went and created the

        http://mercurial.selenic.com/wiki/GeneralDelta

page, which says:

        "The original Mercurial compression format has a particular
        weakness in storing and transmitting deltas for branches that
        are heavily interleaved.  In some instances, this can make the
        size of the manifest data (stored in 00manifest.d) balloon by
        10x or more.  The generaldelta option is an effort to mitigate
        that, while still maintaining Mercurial's O(1)-bounded
        performance."

Armed with that information, I then did 2 further clones on the src
repo (see sizes above) using generaldelta, and brought down the size
from 26GB to 2.1GB.

The saving on the pkgsrc repo was even more spectacular:

        2.7G    pkgsrc-20140325.fossil
        2.6G    pkgsrc.git
        22G     pkgsrc.hg2
        1.1G    pkgsrc.hg1
        638M    pkgsrc.hg
        1.6G    repo/pkgsrc

Thanks to mpm and the folks who helped out on #mercurial at freenode,
and a huge raspberry to the pastebin fanatics there.

The upshot of this is that I now have mercurial (and git and fossil)
mirrors, if anyone wants to try it/them out as an alternative to cvs.

I'm currently negotiating with the system administrators at NetBSD.org
about getting some resources to house this on a more formal basis.


Home | Main Index | Thread Index | Old Index