Reply to David Holland's notes and comments

To: tech-repository%netbsd.org@localhost
Subject: Reply to David Holland's notes and comments
From: esr%snark.thyrsus.com@localhost (Eric S. Raymond)
Date: Wed, 7 Jan 2015 05:50:50 -0500 (EST)

Apologies for the slightly belated reply; I'm not subscribed to this
list yet and found David Holland's comments when checking the list
archives to make sure my technical proposal had come through.

(Alan Barrett: I haven't seen a reply from you. If you sent one,
please resend.)

Please copy replies to esr%thyrsus.com@localhost

David Holland:
>While in general I agree, you do realize we already have one family of
>incremental conversions running, right?

Yes.  And knowing what I know about CVS malformations that makes me a
little nervous about the output.  No matter; we can clear all that up,
I have good tools for checking conversion quality.

Specifically, I have a script wrapper that, after conversion to git,
checks for a content match at every tag and branch head. Later today
I'll run it on src - hadn't had time to before since the Great Beast
arrived.

>We're more or less aware of that - the possible choices are git, hg,
>maybe Fossil, and "write something", where that last isn't very
>realistic.

No, it isn't. :-)

Your conversion target is for you to decide.  As I've noted, I don't
think Fossil would scale up well enough to be used here, but since
you've already got a Fossil conversion process in place you can run
your own performance tests to check that.

>So, because git doesn't have real branches (only git-branches) the
>current conversion loses branch information. Is this limitation also
>present in the git-fast-export format? If so, is there a way to avoid
>throwing away branch information when converting to hg?

I don't understand what is "real" about CVS ranches that isn't "real"
about git branches.  Both are simply labels pointing to tip revisions
in a tree.  Can you clarify what "branch information" you believe is
being lost?

>That's not what "low or moderate" means in these parts.

Right.  I see from later in the archives that good performance results
are being obtained on small systems from git shallow clones, which is
another argument in favor of git.

>There are enough references to CVS version numbers outside the
>repository (mail archives, published and signed security advisories,
>bug reports) that we need to preserve the CVS version numbers either
>as searchable metadata in the VCS or in some external searchable table
>of equivalences.

This requirement is pretty standard.  You'll get an equivalence table
as a byproduct of the conversion.

>My opinion on this is that all or nearly all of these more or less
>bogus branches should just be eliminated and the import turned into a
>regular add and commit. It might take hand review to identify which
>branches need this treatment; but a good approximation (once one has
>changesets) is any vendor branch import changeset in pkgsrc where the
>same files have never had another version imported on that or any
>other vendor branch.

>From your description it probably is going to take hand review.  I'll
need to look at some concrete examples to be sure I understand all the
ramifications.

>This does not address the other (real) vendor branches in src; I think
>it's clear what the proper semantics are there though.

Agreed.  I don't anticipate any real problems there.

>I... had thought it deduced and stored the information at commit time.

Nope.

>This may be slightly off topic in this thread, but: how does this
>work, and how can it possibly both scale and work reliably? Does it
>check every other file in the repository for similarity (and in every
>previous version) every time you do git log?

I don't know how it works internally.  I believe part of the answer is
that they got acceptable scaling of rename and copy detection at the 
cost of not having it work reliably - that is, it can occasionally throw
false negatives.

What I know is that rename and copy matches are detected by the
porcelain, not natively represented in plumbing (git's filesystem-like
storage engine).  You can explicitly tell the exporter to *generate* R
and C ops (which you want to do if you're shipping to a
container-tracking VCS like hg under which the importer will interpret
them) but the exporter uses a heuristic (probably based on SHA1
matching) to generate them.

>To what extent do your tools allow importing external annotations
>about renames?

Not at all.  The reason should be clear from the foregoing.

>...as above, what about branch metadata?

Again, I don't know what branch metadata you intend.  What is there in
CVS beyond the branch name itself?

>Given that we've had conversions running for some time, which required
>doing a lot of cleanup and turned up some fascinatingly broken things,
>it seems likely to me that we've already stepped on most of these
>problems.

Let us devoutly hope so.

Vendor branches are a defect attractor. The remaining trouble spots 
likely cluster around those.
-- 
		<a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>

Follow-Ups:
- git branches (was: Re: Reply to David Holland's notes and comments)
  - From: David Holland

Prev by Date: Re: git on small systems
Next by Date: git branches (was: Re: Reply to David Holland's notes and comments)
Previous by Thread: git on small systems
Next by Thread: git branches (was: Re: Reply to David Holland's notes and comments)
Indexes:

Home | Main Index | Thread Index | Old Index