pkgsrc-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [changed subject to] pkgsrc in scientific computing




You present some interesting ideas...

I've given some thought to many of the issues you mention below like Intel compilers, MKL, commercial MPI implementations, etc. and decided (for now) to keep them separate from our pkgsrc use.

In the enterprise Linux environment, pkgsrc easily serves to solve one major issue:

As you're probably aware, but I'll state for everyone's benefit, the standard RHEL/CentOS Yum repository is meant to ensure system stability and long-term binary compatibility for commercial enterprise software, not support running the latest open source. Hence, it provides older versions of tools and libraries that are back-patched for security holes, but otherwise not updated for years in some cases. It's also a very small package collection compared to Pkgsrc, Debian packages, FreeBSD Ports, MacPorts, etc.

Trying to build the latest scientific apps against Yum RPMs is therefore highly problematic.

In my view, what pkgsrc does best is allow us to very easily manage the latest open source apps in an environment that's virtually independent of the Yum repos. This is easily done by boostrapping with the following options:

X11_TYPE= modular
PREFER_NATIVE= no
PREFER_PKGSRC= yes

Running ldd on most of our pkgsrc binaries reveals that the only Yum libraries they use are libc and libm.

In a nutshell:

1) We use pkgsrc to quickly deploy mainstream versions of open source software like R, genomics tools, etc. built with the stock GCC suite.

2) Commercial applications are supported by Yum, as intended.

3) For the few open source apps that need to be highly optimized (e.g. WRF weather model), we still do a few caveman installations using ICC, MKL, etc. This seems to be how most software is installed in HPC, by the way, so imagine the time savings pkgsrc could provide in theory.

I attempted to bootstrap a pkgsrc tree with ICC recently, and decided I couldn't justify the time it would take to make it work well. I think this would be the route to use if one wanted to incorporate closed-source components like MKL into pkgsrc packages, though.

On the other hand, an option in the R package to use something like openblas/goto or atlas would be worth pursuing in my view. This should be fairly easy and it would be mostly portable.

I've also been working with another packager on developing MPI packages that install in $PREFIX/openmpi, $PREFIX/mpich, etc. so that they can coexist. The same install prefix is used for libraries and apps that depend on them, so you could have, for instance, multiple fftw packages installed under those same prefixes.

In my experience, I.T. man-hours are the scarcest and most costly resource, even in HPC. Most of our users don't benefit in any meaningful way from the speedup that come from using ICC, MKL, etc. Running software from ordinary GCC-based pkgsrc packages on a cluster or grid reduces months or years of computation to hours or days, and another 20% speedup isn't worth even a modest investment of our time. The cost of the extra core-hours is a fraction of the cost of our time to optimize every build, plus we can usually deploy things much sooner using existing packages.

So, my focus is on creating portable pkgsrc packages that can be quickly deployed on our CentOS clusters and at the same time can be leveraged by users of NetBSD, Dragonfly BSD, Darwin, etc. I think the larger the collection of scientific packages becomes, the more people in the research community will be encouraged to join the cause, which will ultimately benefit everyone, regardless of which OS they use.

Cheers,

    Jason

On 7/7/15 4:24 PM, Thomas Orgis wrote:
Am Tue, 07 Jul 2015 15:28:49 -0500
schrieb Jason Bacon <jwbacon%tds.net@localhost>:

On our HPC clusters, I simply bootstrap a whole new tree about every 6
to 12 months to make newer software versions available.
Ah, so we're indeed on the same page there. We are deploying our first
big setup based on pkgsrc for common software, but always thinking
about other stuff on top.

One tricky thing is how to handle differing compilers, especially since
C++ and Fortran modules are not compatible between them. One solution
is simply not to use those and write wrappers over C in your own code,
but do you happen to deal with getting stuff like HDF5 from pkgsrc with
intel/pgi compilers?

The we got differing MPI implementations. Various commercial software
on top. We want to offer the whole deal and are having endless debates
on how to do it best. Perhaps we should at some point have a longer
discussion with you, too. Now, we're really busy getting a fresh system up
and running, of course with an elaborate structure of environment
modules.

Older trees are
left in place so researchers can finish up projects using the same
version of a package, but eventually deprecated.
We will never delete user software for the lifetime of the system (unless
there is a _really_ nasty security risk from just having it around).
But well, we won't carry all old versions onto the next setup.

I have a lot of scientific packages in wip and more coming, but too
little time to devote to it.
Ah, so you helped us getting some of tha geography stuff going? ;-)

Btw.: I wonder if it makes sense having thousands of TexLive packages
in pkgsrc. It's such a huge collection of packages that actually comes
with its own package manager. It lends itself well to installation in a
separate prefix anyway. In our world, there are various separate
packages in addition to pkgsrc anyway. Pkgsrc takes the place of the
normal GNU/Linux userspace, on top of which specialist software is
installed.

  One of my colleagues here is learning to
package and may join pkgsrc-wip soon.
Yes, if this really works out for us in the long term, I might start
contributing packages, too. Though, there probably always will be
standalone packages we build in-house. I see the need for pkgsrc with
the wildly interdependent stuff.

  There's a lot of work to be done
in categories like math and biology, though. Fortran support needs some
work as well.
Do you have R built with proper BLAS (perhaps even Intel MKL?), and
possibly MPI from pkgsrc? Folks are using this software more and more,
as the field of application of HPC clusters widens.

But, well, let's continue that on a separate thread perhaps, in some
weeks when I can breathe again (*preparing yet another compute node
image*).


Alrighty then,

Thomas



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Jason W. Bacon
  jwbacon%tds.net@localhost

  If a problem can be solved,
  there's no need to worry.
If it cannot be solved, then
  worrying will do no good.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



Home | Main Index | Thread Index | Old Index