Subject: Smarter make update / pkg_chk algo
To: None <tech-pkg@netbsd.org>
From: Martin S. Weber <Ephaeton@gmx.net>
List: tech-pkg
Date: 05/04/2005 10:11:29
Hoi pkg-wizards,

Warning: This begins with a rant/introduction. Scroll to the figures (2+) 
if you want to read the suggestion without the rationale.

pkgsrc

pkgsrc is a great system to build software from source (and obviously also
for managing binary packages). Given that all third party software available
and easily installable to NetBSD is packaged in pkgsrc (or -wip), it is the
first address for users who want to install third party software.

There does exist a "stable" branch, which only gets the most important fixes
and updates (i.e. the "Q" branches - picard would love that), but given that
release cycles of different software projects vary greatly and users often
want and/or need specific version of software, they have to track 
p-c(pkgsrc-current).

pkgsrc-current

These users track p-c, most often not to get the latest and greatest fixes, 
improvements to the infrastructure or new make targets; usually they are
"only" interested in new versions of specific packages (usually only a
couple, or maybe even only one). To install these newer packages, the users
have in general two options: Getting a fresh binary package from somewhere
and install that, or build from source. The former usually includes (for
p-c at least) building these packages on one's own, so they end up with
"having" to build from source. 

(Note: I leave pkg_comp out of this rant, because it makes the former
option available without assaulting the running system. Also note that
I leave "make replace" out because of havoc awaiting on the horizon with 
libraries and ABIs[1], [2])

So the users decide they want to build from source, and as pkgsrc is a great
environment to do that, so be it.

Pitfalls

The main pitfall usually awaiting someone tracking p-c & building from
source is that a package doesn't build, and "make update" breaks somewhere
in the middle. You're left with a system which lacks packages deinstalled
during the progress of the upgrade, have some fresh ones installed, and
in general don't have a way to get the ones back you've lost. This is where
pkg_chk comes into play; As one can create a config file with packages of
interest (and given you can easily create it with pkg_chk, too ...) there
is an option left so that after (reporting the bug &/ sending fixes &)
upgrading p-c there is an easy option to install the lacking packages.

pkg_chk

pkg_chk makes the process of upgrading packages a piece of cake - call the
script, lean back. While Error, Upgrade, Repeat. So having the packages one
started out with is easy (even if it might take several invocations of
pkg_chk, spanning hours, days, weeks ...), yet the time taken to upgrade is
clearly not minimal. Even if no error occurs, a common flaw of pkg_chk is that
it sequentially updates outdated packages found, and in the progress often
compiles packages more often than needed -- i.e. more than *EXACTLY* once.

but the time(1) ...

This may sound harmless, yet consider an example: I was upgrading packages
on my machine, and it was working (hard!) for several days (eep!). I've
taken especially care of two packages, firefox and gimp, in the progress.
Firefox was built four (!) times, The Gimp was built eight or nine times. The
build time for these two (firefox was out of date, so substract one build)
adds to something in the range of a day (when days are measured in int's).
Those two packages take eons of time to build, so I watched them; I bet there's
another day or so lost unnecessarily rebuilding other packages.

wtf

So why the f is this happening ? The reason is twofold: the first part is
pkgsrc's "update" target, the second pkg_chk usage pattern of it.

"make update" records the currently installed packages which require the
package in question to function properly; deinstalls these (recursively!),
builds a fresh version of the package in question, installs it and then
builds the recorded dependant packages (recursively, again).

Now this is fine if you only have one package to update (although it seems
brute to do, but considering libraries and [1, 2], I'm fine with it), the 
problems start when one is updating several packages in one dependancy tree,
and exactly this problem is what pkg_chk is facing.

Forests, paths and happy figures

+-----------+ Consider in the figure here that packages B and C are out of
|   D       | date. No matter where one starts (I'll do for now at C), make
|  /        | update updates the *whole tree* with more work than is needed.
|   C - E   | 1. Deinstall E, F, C.
|  /  \     | 2. Update C
| A     F   | 3. Rebuild E, F
|  \  /     | Next step: Repeat for B.
|   B - G   | 1. Deinstall F, G, B.
+- - - - - -+ 2. Update B
| fig 1     | 3. Rebuild F, G
+-----------+ Now imagine 'F' is 'Firefox' and you've just wasted hours doing
nothing sensible at all (it could be worse; imagine openoffice :). In reality
of course, the trees are much bigger for those packages whose multiple rebuilds
hurt. This is not yet the worst case for this tiny tree though, consider that
F needs C & B, but the (old) versions installed already suffice; newer versions
are in p-c though. Same for C -> A, C -> B. In this case (and if you do it this
order: "F, B, C, A") you build F three times too often.

I'm going to play some with the figures now. At first, I'm tagging the
nodes which need rebuilding (still assuming 'B' and 'C' are out of date).

+-----------+       +-----------+      +-----------+
|   D       |       |   D       |      |   D       |       
|  /        |       |  /        |      |  /        |       
|   C*- E   |       |   C - E   |      |   C+- E+  |
|  /  \     |       |  /  \     |      |  /  \     |
| A     F*  | "B"=> | A     F+  | "C"=>| A     F+  |
|  \  /     |       |  \  /     |      |  \  /     |
|   B*- G   |       |   B+- G+  |      |   B - G   |
+- - - - - -+       +- - - - - -+      +- - - - - -+
| fig 2     |       | fig 2."B" |      | fig 2."C" |
+-----------+       +-----------+      +-----------+
       tag "B"-dependant nodes ..."C"-dep. nodes

'*'  := out of date. Needs update.
'+'  := will be rebuilt during this update
'()' := package is not installed (below)

You can see the first example in the figures above. The problem basically
stems from the fact that only one package is considered at once; yet the
dependancy tree is global. Fine then, treat it as a global:

+-----------+       +-----------+      +-----------+
|   D       |       |   D       |      |   D       |       
|  /        |       |  /        |      |  /        |       
|   C*- E   |       |   C - E   |      |   C+- E+  |
|  /  \     |       |  /  \     |      |  /  \     |
| A     F*  | "B"=> | A     F+  | "C"=>| A     F++ |
|  \  /     |       |  \  /     |      |  \  /     |
|   B*- G   |       |   B+- G+  |      |   B+- G+  |
+- - - - - -+       +- - - - - -+      +- - - - - -+
| fig 3     |       | fig 3."B" |      | fig 3."C" |
+-----------+       +-----------+      +-----------+

The tags of the nodes in the dependancy tree just get added for all
outdated packages. Multiple rebuilds coalesce into one (i.e F++ -> F+)
and you end up with the "big picture" of what to do. If this "global"
tree is updated by the "make update" algorithm, things become easy.
Remember the tree, and prune the whole subtree of dependancies when
rebuilding a node (make update will rebuild the dependancies), but DON'T
dive into the tree afterwards, instead stay on the same level (i.e.
level-order traversal)

+-----------+       +-----------+      +-----------+
|   D       |       |   D       |      |   D       |       
|  /        |       |  /        |      |  /        |       
|   C*- E   |       |   C - E   |      |   C -(E+) |
|  /  \     |       |  /  \     |      |  /  \     |
| A*    F*  | "B"=> | A    (F+) | "C"=>| A    (F+) |
|  \  /     |       |  \  /     |      |  \  /     |
|   B*- G   |       |   B -(G+) |      |   B -(G+) |
+- - - - - -+       +- - - - - -+      +- - - - - -+
| fig 4     |       | fig 4."B" |      | fig 4."C" |
+-----------+       +-----------+      +-----------+
           rebuild "B"    ...and now "C"

After "B" and "C" are rebuilt, E, F and G still need to be built. *ONCE*.

Now ... neither pkg_chk nor pkgsrc implement this scheme. As one can rant
a lot yet without being heard without offering solutions, here's an algorithm:

olist <- Each outdated package.
dlist <- Empty list

For each p in olist, do:
	For each dependant package of p, do:
		olist <- dep pkgs
		dlist <- "p dep-pkg"
	end For
end For

For each p in reverse(olist), do:
	deinstall p
end For

For each p in tsort dlist, do:
	build p
	install p
end For

Or the same thing in (crude, I admit it :) sh(1):

# e.g.
olist="tcl glu png firefox"
dlist=`mktemp bla.XXXXXX`

while :; do
	set -- $olist
	if [ $# -gt 0 ]; then
		p=$1; shift; olist=$@
		for j in `pkg_info -qR $p`; do
			echo "$p $j" >> ${dlist}
			olist="${olist} $j"
		done
	else
		break
	fi
done

# tsort does the topological sort needed. There's no
# cyclic dependancies in pkgsrc.
ordered=`tsort ${dlist}` ; rm -f ${dlist}

# deinstall reversed($ordered), build & install $ordered.


Well then...

I think I made it obvious that the current choice of algorithm for updating
several packages at once (i.e. pkg_chk + make update) is sub-optimal, and that
one can easily plug in something more suitable to the task of upgrading
several packages at once. I didn't cook up a patch against pkg_chk (I once did,
years ago, but didn't record it in gnats :( ) because I think this is so
trivial that it might even end up in pkgsrc's .mks themselves. (top level "update"
target ?). Furthermore I don't really know how the algo should be used; one
could pkg_comp $ordered (avoiding multiple attempts to install a single pkg),
one could directly build the packages...

Comments ?

I feel this is just too easy to be true, especially given how long people complain
about that already and given how easy the fix seems to me. Can you tell me what
I've overseen ?

Regards,

-Martin

[1]: http://mail-index.netbsd.org/tech-pkg/2005/01/26/0002.html
[2]: http://mail-index.netbsd.org/tech-pkg/2004/10/10/0020.html