tech-pkg: Re: SoC Part I: pbulk

Subject: Re: SoC Part I: pbulk
To: Joerg Sonnenberger <joerg@britannica.bec.de>
From: Hubert Feyrer <hubert@feyrer.de>
List: tech-pkg
Date: 05/16/2007 20:44:32
Hi,

On Wed, 16 May 2007, Joerg Sonnenberger wrote:
> attached is a summary of the parallel bulk build system. Feel free to
> ask for clarifications or enhancements.

The proposal doesn't outline the design goal and constraints, so it's a 
bit tedious to digest, leaving bits to guesswork. An "1000 miles above" 
(well, maybe only 1 mile ;) would have been nice as intro.

Anyways, a few comments:


> The parallel bulk build system
> ==============================
> 
> Overview
> --------
> 
> For pbulk, three different phases are run. The phase are
> tree-scanning/prebuild, build and post-build.
> 
> The pbulk system is modular and allows customisation of each phase. This
> is used handle to full vs. limited bulk builds, but also to handle
> environental differences for parallel builds.

Will/can any of the existing code be reused?


> Tree-scanning and prebuild phase
> --------------------------------
> 
> The heart of the tree-scanning code consists of the pbulk-index and
> pbulk-index-item make targets. For full bulk builds, a list of all
> directories is compiled and the pbulk-index target called in each.
> Optional parallel scanning can be done using a client/master mode where
> the output is forwarded to the master over a socket.
> 
> The entries are sorted by global scanning order, aka SUBDIR list in the
> main Makefile and the category SUBDIRs. Duplicate entries for a PKGNAME
> are ignored. The output of the build specifies the variables used by
> pbulk for building, filtering the upload and creating the reports.
> 
> After all packages and dependencies have been extracted, the global
> dependency tree is built by resolving the listed dependencies. Packages
> with missing dependencies are marked as broken. The directories in the

Why would a package have missing dependencies?
(Guesswork: is this to work around broken/inconsistent pkgsrc, or does one 
have to list all the dependencies for a partial build?)
(What problem are you trying to solve here?)


> _ALL_DEPENDS are used as hints, but can be overriden.
> 
> For partial builds, two different mechanisms could be used. I'm not sure
> which is the better. For both a list of directories to scan is given.
> For the first idea, pbulk-index is called that gives all possible
> packages to create. Those are filtered by a pattern. The second approach
> is to list the options directly and call pbulk-index-item instead.

(pbulk-index?)

What is that filtering pattern - an addition to the list of directories in 
pkgsrc for pkgs to build, defined by the pkg builder?



> Dependencies are resolved for partial builds as well, but missing
> dependencies are searched by calling pbulk-index in the given
> directories. Those that fulfill the patterns are adding to the list and
> the process is repeated.

I'm not 100% sure what depends you mean here - if it's in pkgsrc it was 
either already built and is available as binary pkg and can be pkg_add'ed, 
or can be built. What is that pattern for, and is it something different 
then the one mentioned above?


> In preperation for the tree-scanning, ${PREFIX} will be removed and
> recreated from bootstrap kit. Alternatively, pkg_delete -r \* followed
> by checking for unlisted files and directories could be used, which is a
> lot slower though. The bootstrap kit is prefered for the build phase
> anyway.

Nuking $PREFIX is fine & fast, please consider update builds e.g. from a 
pkgsrc stable branch. No need to rebuild everything there (which was about 
THE design criteria for the current bulk build code :).

BTW, do you take any precaution to make sure the build is done on a 
"clear" system, e.g. one with a freshly installed $OS release in a chroot?

Also: will the bootstrap kit be mandatory on NetBSD systems, too?
It should have all this in base, and while it's probably fast to build 
compared to the bulk build's time, but for a selective built it seems like 
overkill to me.


> Build phase
> -----------
> 
> Based on the dependency tree, the master process creates an internal job
> list and hands out the information from the tree-scanning (e.g. which
> parameters to set) to clients on request. The client is pulling for jobs
> and reports back when the build is done (or failed).
> 
> For normal bulk builds, the ${PREFIX} will be removed before the build,
> all dependencies added via pkg_add and the package removed after it was
> successfully packaged. Depending on the environment, pkg_add can use FTP
> or PACKAGES be resynced before.

How does this FTP/rsyncing before (yumm) play with the distributed 
environment (NFS) mentioned in answers to the proposal? Or is this for a 
setup with no common file system? (guessing)


> Post-build phase
> ----------------
> 
> Once the build is done, four different tasks are left:
> - gather all failed packages, upload the build logs and send mail to the
> admin
> - create the pkg_summary file for all packages
> - create signatures for the index/all packages
> - upload all (unrestricted) packages
> 
> This might need human intervention.


The first and last part of your proposal look pretty much the same as the 
current system, and the "job scheduling" in the middle seems to be the 
interesting part. Guessing that you already have this running, I wonder if 
what the reasons are to not reuse the current code. (Just curious).

BTW, did you take SMP machines into account in any way?

Please also don't forget putting "enduser documentation" on your agenda. 
:-)


  - Hubert