Re: pkgsrc scanning performance benchmarks

To: Pkgsrc NetBSD <pkgsrc-users%netbsd.org@localhost>
Subject: Re: pkgsrc scanning performance benchmarks
From: Joerg Sonnenberger <joerg%bec.de@localhost>
Date: Sun, 4 Dec 2016 18:37:10 +0100

On Sun, Dec 04, 2016 at 04:57:34PM +0100, Benny Siegert wrote:
> Most variable checks do not involve any processing by make. I grepped
> the Makefile (only) for a declaration of the variable and checked if it
> was a simple „FOO=bar“ line, without metacharacters such as $. In this
> case, I used the value directly, in all other cases, I forked make.
> This simple heuristic was used in about 3/4 of all lookups and much faster.

All nice and good, but it doesn't really matter for tree scanning. At
the very least, the dependency list is almost nowhere statically
computable, since at least checkperms and digest are conditional
dependencies. As soon as you are at the point of "I need to call make
anyway", it doesn't make a difference whether you are extracing one
variable or a hundred. I.e. the only time "make pbulk-index" is
significantly different from a non-recursive "make clean" is for
multi-packages and in that case, it should pretty much scale lineary in
the number of combinations.

> >> I do not believe the pkgsrc framework is 28 times more complex than the
> >> Ports Collection framework.  It's just much more inefficient.  I know such
> >> statements rankle some pkgsrc devs, but numbers don't lie.
> > 
> > If you compare Apples and Oranges, numbers do lie. It might surprise
> > you, but it is a well known fact that the tree scanning i.e. as part of
> > the bulk build is a very time consuming component.
> 
> It surprises no one.
> 
> Jonathan Perkin has done some work eliminating the lowest hanging fruit
> in scanning. I suspect that more gains can be had by looking carefully
> at how buildlink files are evaluated.
> 
> FWIW, scan performance improvements would be very welcome. Current scan times are bordering on ridiculous.

There are two parts here. First of all, scan performance is only a very
small part of the total build time. While any improvement here counts
roughly twice, it is only between 2% and 5% of the total time of a bulk
build. Especially on release branches the incremental scanner can avoid
most of the work most of the time as well. The second part is that it is
very easy to compute a ballpark number of where the scan time would be
in an ideal world. I.e. on my build machine, I need around 0.15s for
"make clean" or "make pbulk-index" in pkgtools/digest. It is not
entirely trivial like a meta package, but it doesn't include much
either. When we use that as desired base line, we arrive at a total scan
time for this hardware of 0.15s * 17355 / 16 ~= 163s. Reality out of the
box is around 39min, so a factor of 10 off.

The difficult part is not knowing what the ideal world should be, but it
is analyzing where the time is spent and how that can be improved.
Comparing to a totally different system doesn't help with that.

> > There have been hacks proposed in the past to replace the make extraction,
> > but none of the proposals actually work properly, because they disable
> > important functional parts. This *is* a case where pkgsrc is actually
> > significantly more complex than ports.
> 
> The trick is recognizing when to use the full make invocation.

As I said before, no. At least for the purpose of bulk build scanning,
it is practically impossible to get correct results without doing the
make invocation. Implementing enough of make to emulate all the slightly
smart conditionals doesn't count. It is certainly possible to skip the
make processing for a large part of the tree if you only want to answer
the question of "What PKGNAME is found in x/y?" but that helps you
little for a bulk build.

> > Architectionally, there are three bigger parts that slow things down as
> > far as the scan phase is concerned:
> > (1) Finding the builtins and computing the resulting versions.
> 
> If this is a significant time of the scan (I have not checked), one way to fix this would be:
> 
> After bootstrapping, evaluate all the builtin.mk files (there are a
> hundred or so) once and write the resulting variables into mk.conf.

Not maintainable. There is a reason why incremental scanning is not
enabled by default and building builtin.mk files at bootstrap (which
bootstrap asks the NetBSD user?) is even more fragile.

> > The second part is done with the help of some external scripts because
> > doing it in make internally is pretty much impossible. A single
> > monolithic program would be faster than the repeated pkg_admin pmatch
> > calls, but I don't think the total time spend on this justifies the
> > cost.
> 
> Again, we need to measure first, then fix things.

Most of the existing optimisations have been measured. For others, I
have tried to get a good idea on where the time is spent, but it is not
easy. If you can provide appropiate analysis, it would certainly be
appreciated. Not all issues can be fixed and some of the potential fixes
are intrusive or require a lot reviewing or changing lots of files,
making them potentially not worth the trouble either. But without doing
an actual analysis first, it is just talk wasting time.

Joerg

References:
- pkgsrc scanning performance benchmarks
  - From: John Marino
- Re: pkgsrc scanning performance benchmarks
  - From: Joerg Sonnenberger
- Re: pkgsrc scanning performance benchmarks
  - From: Benny Siegert

Prev by Date: Re: anyone care to see synth in action?
Next by Date: Libreoffice Build v5.1.5.2 on amd64
Previous by Thread: Re: pkgsrc scanning performance benchmarks
Next by Thread: Re: pkgsrc scanning performance benchmarks
Indexes:

Home | Main Index | Thread Index | Old Index