tech-pkg: Re: packages hierarchy and README.html

Subject: Re: packages hierarchy and README.html
To: None <tech-pkg@NetBSD.org>
From: Dan McMahill <dmcmahill@NetBSD.org>
List: tech-pkg
Date: 04/27/2005 19:28:31
On Tue, Apr 26, 2005 at 06:07:00PM -0400, Dan McMahill wrote:

[long text about README.html generation removed]

> Clearly this test is broken.  For example you may mount pkgsrc 
> read only or you may have your binary packages in some other
> location which has extra storage.

well, I suppose the read only pkgsrc case has other issues like
not being able to write out the README.html files.  But the problem
is still there if you put the packages outside of pkgsrc.
 
> Now, how to make things better?  One possibility is use
> 
> pkgs=`find ${PACKAGES} -name foo-\* -type f`
> 
> where the -type f skips the soft links from the categories to 
> All.  Now for each $pkgs, use pkg_info -B to extract
> OPSYS=NetBSD
> OS_VERSION=2.0
> MACHINE_ARCH=alpha
> 
> The advantage is we can lose the MULTIARCH test which is
> broken anyway and we should always get the right operating
> system, version, and MACHINE_ARCH for each binary package.
> We get support for multiple operating systems for free.
> 
> The drawback is with 5,000 packages, you're running
> find 5,000 times and pkg_info potentially many more times.
> For example, ftp.netbsd.org has right at about 100,000
> binary packages (!).
> 
> At 0.125 seconds per pkg_info (taken from pkg_info -B
> on about 300 pkgs on a lightly loaded alpha PC164)
> this is 4 hours of pkg_info.  For a more reasonable
> 5,000 binary packages its only 10 minutes.
 
I've played around some with a fairly simple shell script
which runs pkg_info and creates a cache file for each directory.
The script is smart enough to figure out if it needs to 
regenerate the cache file or not.  I've verified on my alpha
that the script runs 60x faster the second time (where it
is just checking to see if the cache is still valid).  The
case which is not handled is if a binary package is removed.
So to handle that I should not assume presence in the cache
means the package exists.

The script works by looking into each subdirectory of ${PACKAGES}.
Then it looks for *.tgz files (not symbolic links) and runs
pkg_info to see if the .tgz file is in fact a binary package.
If so, then it decides that directory is in fact one which holds
binary packages.  It then proceeds to run pkg_info on all *.tgz
files in that directory.

I've still not integrated this into the readme generation for
a trial yet.

Comments?

-Dan

--