Re: Slightly off topic, question about git

To: tech-kern%netbsd.org@localhost
Subject: Re: Slightly off topic, question about git
From: Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Mon, 6 Jun 2022 08:23:34 -0400 (EDT)

> [...], I have a question about git, [...]

I'm not an _expert_ on git, but I have been using it for close on a
decade now and have developed a certain amount of expertise.

> 1.  In CVS, I can do something like:
> cvs log sys/dev/pci/if_bge.c
> and be given a complete history of the changes to that file, as well
> as a list of all the branches that file participates in and which
> versions apply to each branch.

git log -- sys/dev/pci/if_bge.c

> And, I can do this without having to download all of the history of
> that file onto my local storage.

That, you will not find with git.  git does, somewhat, support what is
called a shallow clone of a repo, but it is limited by restricting it
to recent commits, not by restricting it to only certain portions of
the tree.  I'm not aware of any way to do the latter.

> It seems like the only way to do this with a git repository is to
> download the entire source tree, along with its history and branches,
> using git clone with an infinite depth.  Is this correct?

Close.

What you want here is not well-supported by git; it is antithetical to
what as I understand it is one of the underlying tenets of git, the
distributed nature of it.  (See below for a little more on this.)

I would say that the best way to set something like that up with a DVCS
would be to provide ssh logins on a central repo-holding machine; if
you want to lock it down further, restrict what those logins can run.

> 2.  Also, in my exploration of git, it seems like the git log command
> shows all the commits for each tag, rather than the comments for a
> specific file or object in the repository.  Again, is this correct?

Well, I'm not sure what you mean here by "all the commits for each
tag".  In git, a tag is attached to a single commit (which can affect
multiple files, but it's still a single commit).  That is, "all the
commits for [a] tag" is always a set of size one (or size zero, if no
tag with that name exists).

I'm guessing here, but my guess is that you are coming from a CVS
mindset, in which a changeset affecting multiple files is considered
one commit per file.  That's not how git works.  In git, a commit
consists, conceptually, of a tree (a packaging-up of all the file
contents and directory structure) plus some overhead, such as a commit
message, author name, and a few other small things.  There is
cleverness under the hood to optimize away most of the storage that
appears to imply in most cases, but that's the concept.

As for restricting git log output to a single file or directory
subtree, you can do that with something like

git log tagname -- file file file...

> If I am correct in my guesses about how git works, it seems like I
> would have to download the entire history of the NetBSD source tree
> if I want to browse its branches, or the commit history for any given
> file.

Close, yes.

> This is a lot of overhead to examine tiny portions of the tree,
> relatively speaking, assuming we move to git for our version control
> system.

It is.  That's why there are various tools out there that let you look
at only part of a tree, kind of like cvsweb.  I've written one myself,
which uses puffs to present a filesystem view of a git repo.  You can
find a live example of it in my anonymous FTP space (also available
over HTTP), ftp.rodents-montreal.org:/mouse/git-unpacked; this includes
the history of my semi-private forks of three NetBSD versions, which
admittedly is far less than full NetBSD history.  (The version in my
FTP space also includes a lot of other repos; the NetBSD ones are under
Mouse/netbsd-fork/.)

> It strikes me that requiring this much storage space from developers,
> would be a regression from what we currently do.

Yes, it would be.  Personally, I think the benefits it brings would be
worth it.

I have access to a copy, on a work machine, of the Linux kernel git
repo as of sometime 2020-10-15.  I don't know how it would compare to a
repo with full NetBSD history, but it's the closest thing I have access
to.  The checked-out tree size is close to that for NetBSD 5.2 /usr/src
(based on du -s output - Linux kernel, 1149672k, NetBSD src, 947992k).

The .git directory, holding all the overhead, is 1800764k.  (That's for
the Linux repo; for my NetBSD fork, 214196k, but I have comparatively
few commits - I didn't import full NetBSD history, instead just
starting from NetBSD 5.2 source as released.  The size of the overhead
is, in most cases, more dependent on the size of the commit tree than
on the size of the checked-out tree - though that's true only when the
tree is mostly changes to existing files; if you're constantly
introducing new files, it becomes less so.)

Personally, not even I, retrocomputing geek that I am, find two gigs of
overhead onerous for the benefits it brings.  Significantly more
onerous is that git really really wants you to have enough RAM to keep
stat() results for the whole working tree in core; various common
operations become painfully slow if that's not true.  On my smallest
machines this means that making commits can be a multi-minute process,
but only because I insist on self-hosting.

> Since I think we're smarter than that and since we have very smart
> people on our development team, I want to understand what it is that
> I don't get about git that precludes me from having to download the
> entire history of the source tree from day one while still retaining
> access to that history over time.

In two words, "design philosophy".  git was/is designed around a
distributed model, one in which there does not have to be any central
master repo (though it certainly can be, and often is, used that way).

Is that better than other designs?  For a lot of purposes, it is; for a
bunch more purposes, it's an entirely tolerable price.  For yet others,
of course, it's not, and if NetBSD decides it's one of that last class,
it would be stupid for NetBSD to switch to git.  In my own opinion, for
NetBSD, it would be one of the other two classes (which class depends
on the use case in question).

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse%rodents-montreal.org@localhost
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Follow-Ups:
- Re: Slightly off topic, question about git
  - From: matthew sporleder

References:
- Slightly off topic, question about git
  - From: Brian Buhrow

Prev by Date: Re: Slightly off topic, question about git
Next by Date: Re: Slightly off topic, question about git
Previous by Thread: Re: Slightly off topic, question about git
Next by Thread: Re: Slightly off topic, question about git
Indexes:

Home | Main Index | Thread Index | Old Index