[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Distributed bulk building for slow machines
- Subject: Re: Distributed bulk building for slow machines
- From: John Klos <john%ziaspace.com@localhost>
- Date: Thu, 6 Jan 2011 20:51:36 +0000 (UTC)
Essentially, a sandbox would be created on worker machines where sshd
is run inside of the sandbox with ssh keys which give the master the
ability to send commands and rsync files. The master would iterate
through the package list and remotely run a "make package" for each,
noting any failures, then possibly sync files afterwards.
This one I'm not so sure of, as this would prevent users sitting behind
NAT boxes from participating as clients or slaves. It could perhaps be
better if the clients periodically reported in their status, and if a
client isn't heard from in a long while, the corresponding work could be
released and farmed out to another client.
Between port forwarding and IPv6, I don't think this is a problem. If
someone is stuck behind NAT and can't get a port forwarded, perhaps we can
figure something out. But because we're talking about building binaries
which will be officially offered on NetBSD servers, it'd have to be ssh
(or otherwise encrypted).
I had a separate smtp-like client/server protocol in mind which
could be used for reporting liveness/status and also to pull new
work from the master.
So long as file transfer is always done via ssh... I can't think of any
way someone could compromise security by telling a worker to build some
other package than what it's supposed to.
Similar to the other comment in this thread, this sounds to me like a
bad idea. If we're using slow machines, there's no gain to be had by
duplicating work. Instead, keep one machine building a given package
until it's finished or not heard from in a while.
I don't want to duplicate any work. Perhaps I was misunderstood. What I
was getting at is if there are 1,000 packages which are dependencies for
other packages, we'd need 1,001 machines before we'd have to worry about
that 1,001st machine building a package for which a dependency doesn't
already exist, and therefore it tries to build that dependency itself.
But it sounds like this isn't anything we need to worry about because
distbb takes care of all of that.
I think that there are a sufficient number of "independent" or "initial
dependencies" packages that you need a large number of clients before
you run out of initial work to farm out to the clients, and you have to
leave clients idling waiting for results.
One thing to keep in mind, though, is this: all the machines
participating in such a distributed effort need to run the same
base OS release, and also run only the official release for the
duration of its contribution.
That's why I think a sandbox would be preferred. It might also be nice to
trick the building tools into reporting whatever version of the kernel we
want, based on the OS version in the sandbox, not based on the booted
kernel. (If I remember correctly, this is possible, but I forget how)
Thought also needs to be given to how the (presumed local) pkgsrc
repository is to be kept in sync, and what it would mean that there
might be slight version skews even if all were updated to e.g. 2010Q3,
since the update may have been done at different times. In my thoughts
this gives rise to the possibility of a request from the master saying
"build package so-and-so version x.y.z" and the client ending up saying
"sorry, my pkgsrc doesn't have version x.y.z of package so-and-so
In my imagined world, the master would tell the slave to cd to
/usr/pkgsrc/lang/perl5 and make package, not make perl-5.12.2nb1. Not sure
how distbb does it.
Part of the rationale behind this entire thing is that packages will be
built slowly, so the tree would get updated often relative to the number
of packages built. It'd be almost assumed that some archs would never
finish. Therefore, there'd be times when builds would be stopped, the
pkgsrc tree updated, and building would be restarted.
Imagine, for instance, what might happen if there's a security issue in
perl. Will we wait for the entire 10,000 packages to be finished, then
restart building? Heck no. We'd update the pkgsrc tree, make sure all of
the "priority" packages which depend on perl are rebuilt, then either
restart or resume the bulk build.
Then there's of course also the issue of ... um... security; up till now
packages on ftp.netbsd.org have been built by developers, and a change
to allow anonymous contributions (not a given...) could be considered to
be problematical. Therefore, it's a question of whether we need some
administration of who contirbutes cycles, and whether there needs to be
some token- based authentication scheme for a client to be able to
I'd assume that all machines would be in the control of NetBSD developers
and we'd exchange ssh keys on one of the NetBSD project servers.
If someone has a good amount of usable system resources, we'd probably
have to deputize them and have them agree to the same kinds of things we
developers agree to. We can figure that out when we get there.
Also... note that trying to build some packages are prone to cause
panics on at least m68k hosts, the message seen on my hp300 systems are
(from memory) "out of address space", and I beleive this has to do with
the pmap implementation. So when doing distributed bulk builds for
m68k, some packages need to be marked "no way" (and/or someone needs to
take a hard look at re-doing the m68k pmap implementation...) I seem to
recall that some of the clisp implementations would cause this, as will
trying to build icu, which is needed for lang/parrot, so no parrot (or
parrot-based perl6) for m68k unless this problem is fixed...
This is more justification for sandboxes. I've updated my m68k kernels
based on Michael Hitch's recommendations to mitigate pmap issues for the
moment, but the builds should reflect a generic OS version number and the
userland in which they're built should be the same as what people download
from ftp.NetBSD.org. Yes, we don't want to panic machines. More
importantly, though, is that the underlying problem gets fixed sometime
Main Index |
Thread Index |