Subject: packages hierarchy and README.html
To: None <tech-pkg@NetBSD.org>
From: Dan McMahill <dmcmahill@NetBSD.org>
List: tech-pkg
Date: 04/26/2005 18:07:00
In investigating a PR, I've had cause to look again at the code
in bsd.pkg.mk and mk/scripts/genreadme.awk which generates the
README.html files for pkgsrc.  There are some bad assumptions
in both.  I'm soliciting opinions here and hopefully I won't
replace some broken code with some different broken code.


Currently what happens is in bsd.pkg.mk, there is a test which
compares ${PACKAGES} to ${PKGSRCDIR}/packages.  In slightly 
simplified form the test is:

cd ${PACKAGES}
case `pwd` in
	${PKGSRCDIR}/packages)
		MULTIARCH=no
		;;

	*)
		MULTIARCH=yes
		;;
esac

Basically this is trying to figure out if you have a tree like

/usr/pkgsrc/packages/{All,archivers,audio,cad,...} with binary
packages living under /usr/pkgsrc/packages/All or if you have
a tree like whats found at ftp://ftp.netbsd.org/pub/NetBSD/packages/
which is like

.../packages/{1.6.2/{alpha,i386,sparc},2.0/{amd64,i386,sparc64}}

and under each of those directories you have
{{All,archivers,audio,cad,...} again with the actual binary pkgs
living under All.  So our choice is do you have something like

/usr/pkgsrc/packages/All/foo-2.0.tar.gz

or

/some/other/place/packages/2.0/vax/All/foo-2.0.tar.gz

Clearly this test is broken.  For example you may mount pkgsrc 
read only or you may have your binary packages in some other
location which has extra storage.  In both of these cases you'll
end up with MULTIARCH=yes even though thats not right.

Now in mk/scripts/genreadme.awk, an attempt is made to find
all the binary packages for a particular package.  In the
MULTIARCH=no case, this is done by doing something like

pkgs=`ls -1 /usr/pkgsrc/packages/All/foo-*`

and then for each file in $pkgs, you get a line in your README.html
listing package version and providing a link.  The machine
arch field in the output line, the operating system name, and
operating system version are determined by the machine that generated
the README.html files.
See ftp://ftp.netbsd.org/pub/NetBSD/packages/pkgsrc/devel/gmake/README.html
for an example of what these lines look like.

This is mostly ok as implemented except that you might be
generating README.html files for some other operating system,
architecture, or version.

On the MULTIARCH=yes case, the list of packages is found with
something like

pkgs=`ls -1 /some/other/place/packages/[0-9].*/*/All/foo-*`

The "[0-9].*" is because at
ftp://ftp.netbsd.org/pub/NetBSD/packages/ there are some
other things besides os version directories.

Then genreadme.awk extracts the operating system version and the
machine architecture from the path to the binary package
file.  The operating system name is taken from the system
which generates the README.html file.


----------------------------------


Now, how to make things better?  One possibility is use

pkgs=`find ${PACKAGES} -name foo-\* -type f`

where the -type f skips the soft links from the categories to 
All.  Now for each $pkgs, use pkg_info -B to extract
OPSYS=NetBSD
OS_VERSION=2.0
MACHINE_ARCH=alpha

The advantage is we can lose the MULTIARCH test which is
broken anyway and we should always get the right operating
system, version, and MACHINE_ARCH for each binary package.
We get support for multiple operating systems for free.

The drawback is with 5,000 packages, you're running
find 5,000 times and pkg_info potentially many more times.
For example, ftp.netbsd.org has right at about 100,000
binary packages (!).

At 0.125 seconds per pkg_info (taken from pkg_info -B
on about 300 pkgs on a lightly loaded alpha PC164)
this is 4 hours of pkg_info.  For a more reasonable
5,000 binary packages its only 10 minutes.


Comments on this approach?  Suggestions for a better approach?

Thanks
-Dan


--