Re: is the proof in the pudding?

To: David Holland <dholland-tech%netbsd.org@localhost>
Subject: Re: is the proof in the pudding?
From: "Perry E. Metzger" <perry%piermont.com@localhost>
Date: Tue, 29 Jul 2008 13:47:53 -0400
David Holland <dholland-tech%netbsd.org@localhost> writes:
> On Tue, Jul 29, 2008 at 07:54:40AM -0400, Perry E. Metzger wrote:
>  > > The short answer is that currently none of them is suitable. 
>  > >
>  > > svn has scalability/resource usage problems,
>  > 
>  > FreeBSD has successfully managed to import their entire tree and
>  > people are using it, so it clearly will scale to the appropriate size
>  > of repository.
>
> Nonetheless, many people report that svn uses unreasonable amouts of
> memory and/or disk space, especially when compared to other tools.

That may in fact be the case. I'd prefer for us to make the decision
based on experience by a number of NetBSD testers rather than based on
third party statements. In particular, SVN has gotten much better with
time on memory usage, and it may be that the consumption levels are
not an issue for current developers in any case.

>  > > weird branch/tag semantics,
>  > 
>  > I see nothing at all weird about the semantics.
>
> Perhaps then you should explain how tagging works and let everyone
> decide. Based on how it's been explained to me, I would describe it as
> weird.

It isn't very weird.

In SVN, the "filesystem namespace" makes branches and tags manifest as
subdirectories. The main trunk is one subdirectory, branches are
another subdirectory, etc.

So, that means that the main trunk version of src/bin/ls/ls.c might be
seen in the SVN repo as trunk/src/bin/ls/ls.c, and the NetBSD 7 branch
version might be branch/NETBSD-7/src/bin/ls/ls.c

If you want the whole trunk, you check out trunk/src, if you want the
branch, you check out branch/NAME/src

To branch a subtree, you just "copy" the subtree within the SVN
repository, which is an effectively instant operation.  If you wanted
to, say, branch NetBSD 7 you would just
  svn cp trunk/ branch/NETBSD-7
(I may have the syntax slightly off, it has been a few months.)

A tag is the same sort of operation, and is (again) instant. Tags are
not used to indicate "branch points" in SVN, they're just a way of
snapshotting the repository. There is no need in SVN to mark the
revisions of all files where a branch happened -- it knows on its own,
and in any case all updates are repository atomic, history is
preserved, etc. All the problems tags help you get around in CVS don't
exist in SVN. People usually only use a tag in SVN to memorialize
things (like "these are the released netbsd-7.3 sources, never to be
touched for all time".)

The particular naming scheme used is a convention, btw.  Generally,
people will set up a portion of an SVN repository so that the main
sources are under something like trunk/ and branches are under
something like branch/NAME and tags under tag/NAME etc.

You may protest that this idea of doing a "copy" to branch a repo is
bizarre and alien, but that's just because you're not used to it. It
is no stranger than CVS's mechanisms. It isn't the way CVS does
things, to be sure, but that's a good thing. It is actually much less
error prone, and much easier to use.

The copied files all "know" where they came from, history is
preserved, and you can merge back and forth pretty easily between the
subtrees.

Sure, this might seem "bizarre" to you, just as a Windows head might
find it "bizarre" that Unix doesn't need to shove .EXE into the name
of executable files, but Unix has no need for such fluff, and SVN has
no need for more complicated mechanisms for dealing with branches and
tags.

The manual explains it in great detail. I've used it and have never
found it confusing.

>
>  > > and has in the past earned a bad reputation for reliability.
>  > 
>  > I don't think that's true.
>
> That reputation is real and it earned it.

No, it is not. I don't know how to say this politely, so I'll say it
frankly. You've clearly never used SVN -- your questions reveal
that. You have no basis on which to make this claim.

I've used SVN in very high reliability corporate environments --
things like managing all the trading code for a hedge fund. I've never
seen data loss, I've never gotten a complaint about data loss. I've
read the SVN mailing lists for years. I've never seen anyone complain
about data loss. Your claims about data loss are false. It doesn't
happen.

> The question is whether it's still relevant, both as a matter of
> whether the problems have been fixed

There has never been a release version of SVN where I'm aware of data
loss problems.

There *have* been problems with BDB repositories getting jammed up
which prevents people from doing commits and checkouts and requires
that you run a recovery program to play the log, but no one I'm aware
of has lost data, and no one uses BDB any more anyway.

As I said, and as I will repeat, SVN also permits (and even
encourages) that you dump the database into a backup format it
generates -- a backup format that is pure ASCII, human readable, and
hand editable. You can do that 20 times a day, without significant
performance hit, if you feel that will make you more secure. If you
had ever administered an SVN repo, you would know about this.

>  > Based on your comments, I think you haven't actually administered an
>  > SVN system in real use. I've run a couple, with dozens of developers,
>  > and they're fine.
>
> No.[...]

Then I suggest you try it. You have no idea what it will actually be
like. It is unreasonable of you to judge without actually using it.

>  > > Also it doesn't really offer a whole lot over CVS.
>  > 
>  > It offers a great deal. Commits are atomic across the entire
>  > system. Branches and tags are almost instant. It is possible to move
>  > files while retaining history.
>
> But every system other than CVS offers these features.

Certainly. No one denied that. That's very different from saying it
offers nothing over CVS.

> (Also, according to my notes, rename is implemented as delete and
> add, so while tree history is maintained, file history doesn't cross
> the rename, so that's only sort of "retaining history".)

No, that's not true.

Rename is "copy and delete", and copies retain history. (Given the
architecture, "copy and delete" is the sane way to do things.)

Again, you should stop discussing SVN if you've never used it and have
no idea how it works. It is utterly unreasonable to pass judgment when
a fact as basic as whether renames preserve history is something you
don't know.

>  > > git doesn't handle subtrees, uses hash codes instead of version
>  > > numbers,
>  > 
>  > The hash codes are not the same thing as version numbers in other
>  > systems. [...]
>
> Yes, they are. They play the same semantic role that version numbers
> do in the non-distributed systems.

No, not really.

> Would you please stop assuming I have no idea what I'm talking about?

You keep making it clear you haven't used SVN or git, and you keep
making mistakes when describing their properties.

> I've been using Mercurial for a number of projects for some time,

Mercurial is not SVN and is not git.

> They are required for a distributed version control system, yes.
> However, we don't need a distributed version control system for
> NetBSD.

I'm not sure we do or that we don't. My own favorite is SVN, which is
not distributed, but there are very strong arguments my friends who do
distributed VCS make for the benefits, and the small amount of work
I've done with distributed systems makes me think they may in fact be
correct.

Like it or not, we don't trust everyone with a good idea to commit to
the main repository, and a distributed VCS makes it easy for people to
work on large projects without our having to trust them up front. Like
it or not, we don't all have the ability to get to the main repository
at all times, and it is exceptionally convenient to be able to commit
intermediate stages of your work. It is nice to have cheap totally
private branches and then have them vanish into a single public
commit.

I'd like to see the distributed experiment tried, if only so that we
know what we're missing before we say, in advance, that it is
worthless.

> Meanwhile, they create a number of usability problems, some of
> which have already been touched on: one can't remember them,

One usually doesn't *need* to remember them.

> they are a hassle to type in when you don't have cut and paste,

That's false. Git, for example, accepts shortest unique substring. You
can usually just type five characters and it will deal fine. If you
had used git (I have, though not extensively), you would know this.

> they're a (smaller) hassle to paste in even when you do, you can't
> determine ordering without digging in the SCM database,

But people generally don't use the hash numbers the way they seem to
use version numbers, and again, if we really wanted "56" "57" "58"
exposed so people could refer to versions on the head that way, the
git plumbing would make that possible.

> There are also some technical considerations, like tying yourself to a
> particular hash function,

That's not a big deal. You could (in theory) upgrade a giant repo to a
new hash function pretty quickly. It would require anyone regularly
updating to do the upgrade as well, of course.

> or what happens if a collision occurs.

You would know *very very fast* because operations would fail loudly.

In any case, we can expect a collision after every 2^80 commits since
SHA1 is a 160 bit hash function. If you do 100 commits per second, you
can expect a collision with 50% probability after about 5e14 years,
approximately 40,000 times the lifetime to date of the universe.

> If you need a distributed SCM system, these costs are worth paying;
> but if you don't, and we don't,

That's not clear. There is a reason, I think, that all new VCSes are
distributed, and it isn't just because it is trendy. People find
they're really very convenient, and that they make programmers more
productive.

> As for an integrity check - they're hardly the only way to do that,
> and arguably not a good way either.

They seem like a fine way to do it, and they're built in. I see no
arguments against them.

>  > > can't check out a tree without cloning the full history,
>  > 
>  > Actually, you can check out a tree without cloning the full
>  > history.
>
> Can you?

Yes, you can.

>  > > and has kind of a messy install with tons of executables.
>  > 
>  > "Messy?" -- that's a totally unreasonable objection.
>
> It is not. It's a real problem.

This is silly.

For those not in the know, git provides two kinds of ways of typing in
commands.

You can type, for any command:

git-foo

or

git foo

If it bugs you that all the git-foo commands are lying about, then
shove them in a subdir in /usr/libexec and you'll never see them. Again,
if you don't object to postfix for having 26 installed executables,
then why is this a problem?

> *None* of the problems any of the systems have are non-negotiable.  We
> can always make our mind up to import perl into base, or cope with
> /usr/src taking a couple gigabytes, or do without being able to diff
> between release trees,

Which system would prevent us from being able to diff release trees?
I'm unaware of *any* VCS with that property.

BTW, our /usr/src takes 1.5G right now. Just the sources, not
including anything else. If you are angry about having a couple of gig
in /usr/src, the problem ins not the VCS.

> However, making such decisions requires an informed analysis of the
> tradeoffs,

I've often seen this sort of discussion in bureaucracies. It is
usually a way of deliberately killing a project.

"Naturally, we would like to do X, but we need to have properly
studied the matter in a formal way before making a decision."

And who can argue with that sort of thing? You sound like a jerk for
suggesting that things be done informally or "off the cuff". Wouldn't
any sane person prefer "formal process"?

However, the truth is, decisions like this are rarely well made by
producing long requirements documents and having extensive meetings
and formal process. (That's the way you end up deciding that the only
real solution to your OS problem is Microsoft Windows because it has
all the check boxes in place.)

The best way to make a decision like his is by having people try out
several solutions and decide, based on quality and usability, what
seems best. The committee bureaucracy route is a near universal
mistake, unless your goal is to nip an idea in the bud.

>  > If you worry, just throw all the executables into a directory under
>  > /usr/libexec or something. The only thing you need in the path is the
>  > front end "git" executable.
>
> Is that true now, finally? If so that would be a significant
> improvement.

If it isn't true (I didn't check), the changes required to make it
work are maybe ten lines of code. After all, execing files from a
given directory is a well understood technology. This isn't a rational
basis on which to make a decision. If having too many executables in
the path is a serious problem for people that can be fixed.

>  > I don't know enough about Mercurial, or the others, but based on the
>  > dismissal out of hand of quite reasonable systems I don't think I
>  > trust the dismissal of the others or DARCS either. I want to see the
>  > things installed and try out what it is like to work with them.
>
> darcs would require importing ghc into base. That is a complete
> nonstarter.

Why? If it provided very serious benefits, perhaps it would be worth
it. I, too, doubt it would be the winner, but why are we making
decisions like this in advance of trying? Maybe no one wants darcs,
but if someone wants to champion it, goes through the trouble of
importing NetBSD's CVS into a darcs repository and makes it work, the
least we can do is hear them out.

> In the meantime, please quit bikeshedding.

BIKESHEDDING?

Bikeshedding is when you say "I won't accept that solution because the
program arguments offend me, and I hate the font" and the moral
equivalent -- you're the one doing that, by saying things like "git
offends be because it has too many executables".

I'm suggesting we look at everything, without prejudice and without
saying in advance that we insist everything work the way "we're used
to". That's the opposite of bikeshedding, if anything.

I've proposed a pretty simple way to deal with this.

1) Everyone who likes a particular VCS sets up a demonstration,
   including demonstrating that they can import the repo successfully.
2) We try them out and get a feel for what we like.

Who besides you opposes this?

-- 
Perry E. Metzger                perry%piermont.com@localhost
Follow-Ups:
- Re: is the proof in the pudding?
  - From: David Holland
References:
- is the proof in the pudding?
  - From: Perry E. Metzger
- Re: is the proof in the pudding?
  - From: David Holland
- Re: is the proof in the pudding?
  - From: Perry E. Metzger
- Re: is the proof in the pudding?
  - From: Adam Hamsik
- Re: is the proof in the pudding?
  - From: David Holland
- Re: is the proof in the pudding?
  - From: Perry E. Metzger
- Re: is the proof in the pudding?
  - From: David Holland
Prev by Date: Re: is the proof in the pudding?
Next by Date: Re: preliminary version control requirements
Previous by Thread: Re: is the proof in the pudding?
Next by Thread: Re: is the proof in the pudding?
Indexes:
Home | Main Index | Thread Index | Old Index