Re: Distributed bulk building for slow machines

Subject: Re: Distributed bulk building for slow machines
From: John Klos <john%ziaspace.com@localhost>
Date: Thu, 6 Jan 2011 20:51:36 +0000 (UTC)

Essentially, a sandbox would be created on worker machines where sshdis run inside of the sandbox with ssh keys which give the master theability to send commands and rsync files. The master would iteratethrough the package list and remotely run a "make package" for each,noting any failures, then possibly sync files afterwards.
This one I'm not so sure of, as this would prevent users sitting behindNAT boxes from participating as clients or slaves. It could perhaps bebetter if the clients periodically reported in their status, and if aclient isn't heard from in a long while, the corresponding work could bereleased and farmed out to another client.

Between port forwarding and IPv6, I don't think this is a problem. Ifsomeone is stuck behind NAT and can't get a port forwarded, perhaps we canfigure something out. But because we're talking about building binarieswhich will be officially offered on NetBSD servers, it'd have to be ssh(or otherwise encrypted).

I had a separate smtp-like client/server protocol in mind which
could be used for reporting liveness/status and also to pull new
work from the master.

So long as file transfer is always done via ssh... I can't think of anyway someone could compromise security by telling a worker to build someother package than what it's supposed to.

Similar to the other comment in this thread, this sounds to me like abad idea. If we're using slow machines, there's no gain to be had byduplicating work. Instead, keep one machine building a given packageuntil it's finished or not heard from in a while.

I don't want to duplicate any work. Perhaps I was misunderstood. What Iwas getting at is if there are 1,000 packages which are dependencies forother packages, we'd need 1,001 machines before we'd have to worry aboutthat 1,001st machine building a package for which a dependency doesn'talready exist, and therefore it tries to build that dependency itself.

But it sounds like this isn't anything we need to worry about becausedistbb takes care of all of that.

I think that there are a sufficient number of "independent" or "initialdependencies" packages that you need a large number of clients beforeyou run out of initial work to farm out to the clients, and you have toleave clients idling waiting for results.


Exactly.

One thing to keep in mind, though, is this: all the machines
participating in such a distributed effort need to run the same
base OS release, and also run only the official release for the
duration of its contribution.

That's why I think a sandbox would be preferred. It might also be nice totrick the building tools into reporting whatever version of the kernel wewant, based on the OS version in the sandbox, not based on the bootedkernel. (If I remember correctly, this is possible, but I forget how)

Thought also needs to be given to how the (presumed local) pkgsrcrepository is to be kept in sync, and what it would mean that theremight be slight version skews even if all were updated to e.g. 2010Q3,since the update may have been done at different times. In my thoughtsthis gives rise to the possibility of a request from the master saying"build package so-and-so version x.y.z" and the client ending up saying"sorry, my pkgsrc doesn't have version x.y.z of package so-and-so(yet)".

In my imagined world, the master would tell the slave to cd to/usr/pkgsrc/lang/perl5 and make package, not make perl-5.12.2nb1. Not surehow distbb does it.

Part of the rationale behind this entire thing is that packages will bebuilt slowly, so the tree would get updated often relative to the numberof packages built. It'd be almost assumed that some archs would neverfinish. Therefore, there'd be times when builds would be stopped, thepkgsrc tree updated, and building would be restarted.

Imagine, for instance, what might happen if there's a security issue inperl. Will we wait for the entire 10,000 packages to be finished, thenrestart building? Heck no. We'd update the pkgsrc tree, make sure all ofthe "priority" packages which depend on perl are rebuilt, then eitherrestart or resume the bulk build.

Then there's of course also the issue of ... um... security; up till nowpackages on ftp.netbsd.org have been built by developers, and a changeto allow anonymous contributions (not a given...) could be considered tobe problematical. Therefore, it's a question of whether we need someadministration of who contirbutes cycles, and whether there needs to besome token- based authentication scheme for a client to be able tocontribute.

I'd assume that all machines would be in the control of NetBSD developersand we'd exchange ssh keys on one of the NetBSD project servers.

If someone has a good amount of usable system resources, we'd probablyhave to deputize them and have them agree to the same kinds of things wedevelopers agree to. We can figure that out when we get there.

Also... note that trying to build some packages are prone to causepanics on at least m68k hosts, the message seen on my hp300 systems are(from memory) "out of address space", and I beleive this has to do withthe pmap implementation. So when doing distributed bulk builds form68k, some packages need to be marked "no way" (and/or someone needs totake a hard look at re-doing the m68k pmap implementation...) I seem torecall that some of the clisp implementations would cause this, as willtrying to build icu, which is needed for lang/parrot, so no parrot (orparrot-based perl6) for m68k unless this problem is fixed...

This is more justification for sandboxes. I've updated my m68k kernelsbased on Michael Hitch's recommendations to mitigate pmap issues for themoment, but the builds should reflect a generic OS version number and theuserland in which they're built should be the same as what people downloadfrom ftp.NetBSD.org. Yes, we don't want to panic machines. Moreimportantly, though, is that the underlying problem gets fixed sometimesoon!


John

Follow-Ups:
- Re: Distributed bulk building for slow machines
  - From: John Klos

References:
- Distributed bulk building for slow machines
  - From: John Klos
- Re: Distributed bulk building for slow machines
  - From: Havard Eidnes

Prev by Date: Re: Distributed bulk building for slow machines
Next by Date: Re: Distributed bulk building for slow machines
Previous by Thread: RE: Distributed bulk building for slow machines
Next by Thread: Re: Distributed bulk building for slow machines
Indexes:

Home | Main Index | Thread Index | Old Index