Re: The history of pkgsrc and its categories

To: Alistair Crooks <alistaircrooks%gmail.com@localhost>
Subject: Re: The history of pkgsrc and its categories
From: "Dr. Thomas Orgis" <thomas.orgis%uni-hamburg.de@localhost>
Date: Wed, 8 Oct 2025 23:16:40 +0200
TL;DR: May we consolidate categories, and populate science, after the
repo move?

Am Tue, 7 Oct 2025 18:06:45 -0700
schrieb Alistair Crooks <alistaircrooks%gmail.com@localhost>:

> I've just seen that we have a science category with 1 pkgsrc entry in it.

I'll chime in as I think I am somewhat involved in that category;-)

> the stunning proliferation of
> pkgsrc entries in the science category would seem to suggest otherwise.

There would have been more entries would I (and others) have been able
to devote more time to actually add packages. That's not a great
excuse. I still hold that this will change in future once I work again
on packaging new stuff. Too damn busy on multiple fronts. I do have
local build scripts that could/should be part of pkgsrc. And I do have
problems to solve before I get there. And I should stop whining.

> Just to clarify what we've done in the past - we try to limit the number of
> categories that we have in pkgsrc. 
> […] repository move them all […]
> Once the new all-singing, all-dancing version control software is
> available, the equivalent of repository moves should be back […]

My idea with the science category was exactly to prevent a
profileration of categories that continue the series of

biology
geography
math

to also have

antropology
chemistry
history
law
*_studies
physics
psychology
theology

… or whatever you can think of as academic field. My suggestion is to
merge biology and geography into the more generic 'science' category to
cover at least natural sciences, maybe anything relating to
computational or data analysis methods belonging to some scientific
field, be it social science, linguistics, whatnot. I'm not sure if
people would like to separate 'science' from the 'humanities' or
religious studies. We could call it academics, but that is awkward and
maybe also too wrong, as people can use scientific methods outside of
academic contexts.

Keeping math as separate category would be fine, I think, as
mathematical methods are used _anywhere_ in computing also outside
research. Numerical algorithms can land in either category, depending
how abstracted they are from a particular application (like
discretization methods for differential equations that could be
considered part of fluid dynamics (physics, science) or generic
math/numerics).

The current 'devel' category is not that useful, IMHO. In general, I
must admit, that locating a package in pkgsrc is mildly annoying
because the matching category is not that obvious or inconsistent. I
totdally dig 'editors'. That's an established term for a class of
programs, where the package name itself might not give away what it is
for, but editors/joe provides meaning. I can browse into the category
and see if there's some editor I did not know yet. Though devel/waf
would perhaps make more sense as build_systems/waf and devel/git as
revision_control/git — or we could at least separate development tools
from arbitrary libraries. Why is llvm in lang, even? It does not
provide a (human-readable) programming language. Clang or flang do.
Rust does. Should there be a compiler or runtime category?

Then … when is something TeX-related part of textproc, when of print
(and not part of math or graphics, if it is not a font)?

Right now, the pkgsrc categories are of limited usefulness. They mix
multiple semantic dimensions, inconsistently. If we had just
categories 'a' to 'z' based on package name (perhaps also py-a to py-z
…), that at least would be systematic.

> (FWIW, I don't really like the categories distinction per se, but it came
> over when pkgsrc was started. It could better be done by keywords in the
> Makefiles, but that would leave one main directory with n-thousand entries,
> which is completely unworkable. Multi-level categories or
> sub-categories also quickly become unworkable. So a single layer of
> categories it is :) )

I see this one practical point of categories, dividing the set of
packages into manageable sets as they appear on disk. Debian sorts its
binary packages by name for the same effect, without semantics
(everything starting with letter 'g' goes into category 'g'). The
sources for building the packages are not living in a common
repository, as I recently learned. They likely would have a repo
structure with subdirectories sorted along teams who manage a set of
related packages. This is one aspect to consider: The subdirectories
simply as a way to reduce people stepping on each other's toes, as there
is a chance that they don't work on the same category/Makefile and have
annoying merge conflicts.

In Source Mage GNU/Linux (still alive for some defintion of alive),
which is somewhat similar to BSD ports or pkgsrc, but based on bash
scripts instead of Makefiles, the repo of all packages sorts things
into sections (subdirectories) based on topic but also on practicality
where a set of packages shares a build method or common
options/dependencies. Common build functions (in the sense of shell
functions, as all builds are written as bash scripts) are shared
between packages in some sections. You can refer to these functions
also from another section, but it's more convenient to have things
grouped. Pkgsrc does similar things by providing mk files that can live
anywhere and are included from anywhere, without a convention that
packages doing

	.include "../../lang/python/wheel.mk"

would all live in a category 'python-wheel' (where the wheel.mk might be
called python-wheel/common.mk or such). Of course, with the explosion
of software packages nowadays, trying to catch all of pypi, for
example, in a single category, results in a rather impractical number
of directory entries again … but then, it's impractical to package it
all for other reasons before that.

The python-pypi (or perl-cpan) section in Source Mage also carries only
a very modest subset of the upstream set (around 400 to 500). The
packages therein mostly (all?) inherit the build functions from the
section or overall defaults and just consist of a dependency list and
package description. Explict or implicit code reuse is a major driver
for categories/sections. Basically not having to write the .include
above explicitly to save time and space. In pkgsrc, this would be
achieved by something like


	.include "../section.mk"

as part of the minimal boilerplate, I presume, and just by convention
it being rare that a package does

	.include "../../other/section.mk"


The SPACK packaging system for scientific software (mainly) seems to
avoid levels of subdirectories in its structure to just have all
packages (which are mostly a subdir containing a lonely package.py) in
a 'packages' directory, prompting the github web interface to limit
listing at 1000 entries. 1000 is a rather low number, nowadays … but
for sure, a repo with not only 8549 entries like the spack-packages
one, but 100000 of them, would be annoying to browse with a web
interface. Obvious categories as subdirectories would be best. But you
won't guess obvious categories for everything. So a search function it
will be. Something like

	echo */*xindy*

in the shell gives me a start to check where I'll get the TeX package
xindy from. You could spice it up with keywords and a specific package
search tool (that may already exist hiding behind my ignorance).


Alrighty then,

Thomas

PS: As for discoverability, adding multiple keywords per package and
having each as a category reflected as a directory could work nicely by
placing

cat_b/foo → ../../cat_a/foo

symlinks … ?

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg
References:
- The history of pkgsrc and its categories
  - From: Alistair Crooks
Prev by Date: The history of pkgsrc and its categories
Next by Date: pkg_alternatives broken?
Previous by Thread: The history of pkgsrc and its categories
Next by Thread: pkg_alternatives broken?
Indexes:
Home | Main Index | Thread Index | Old Index