Re: [changed subject to] pkgsrc in scientific computing

To: Thomas Orgis <thomas.orgis%uni-hamburg.de@localhost>
Subject: Re: [changed subject to] pkgsrc in scientific computing
From: Jason Bacon <jwbacon%tds.net@localhost>
Date: Tue, 07 Jul 2015 17:54:49 -0500


You present some interesting ideas...

I've given some thought to many of the issues you mention below likeIntel compilers, MKL, commercial MPI implementations, etc. and decided(for now) to keep them separate from our pkgsrc use.

In the enterprise Linux environment, pkgsrc easily serves to solve onemajor issue:

As you're probably aware, but I'll state for everyone's benefit, thestandard RHEL/CentOS Yum repository is meant to ensure system stabilityand long-term binary compatibility for commercial enterprise software,not support running the latest open source. Hence, it provides olderversions of tools and libraries that are back-patched for securityholes, but otherwise not updated for years in some cases. It's also avery small package collection compared to Pkgsrc, Debian packages,FreeBSD Ports, MacPorts, etc.

Trying to build the latest scientific apps against Yum RPMs is thereforehighly problematic.

In my view, what pkgsrc does best is allow us to very easily manage thelatest open source apps in an environment that's virtually independentof the Yum repos. This is easily done by boostrapping with thefollowing options:


X11_TYPE= modular
PREFER_NATIVE= no
PREFER_PKGSRC= yes

Running ldd on most of our pkgsrc binaries reveals that the only Yumlibraries they use are libc and libm.


In a nutshell:

1) We use pkgsrc to quickly deploy mainstream versions of open sourcesoftware like R, genomics tools, etc. built with the stock GCC suite.


2) Commercial applications are supported by Yum, as intended.

3) For the few open source apps that need to be highly optimized (e.g.WRF weather model), we still do a few caveman installations using ICC,MKL, etc. This seems to be how most software is installed in HPC, bythe way, so imagine the time savings pkgsrc could provide in theory.

I attempted to bootstrap a pkgsrc tree with ICC recently, and decided Icouldn't justify the time it would take to make it work well. I thinkthis would be the route to use if one wanted to incorporateclosed-source components like MKL into pkgsrc packages, though.

On the other hand, an option in the R package to use something likeopenblas/goto or atlas would be worth pursuing in my view. This shouldbe fairly easy and it would be mostly portable.

I've also been working with another packager on developing MPI packagesthat install in $PREFIX/openmpi, $PREFIX/mpich, etc. so that they cancoexist. The same install prefix is used for libraries and apps thatdepend on them, so you could have, for instance, multiple fftw packagesinstalled under those same prefixes.

In my experience, I.T. man-hours are the scarcest and most costlyresource, even in HPC. Most of our users don't benefit in anymeaningful way from the speedup that come from using ICC, MKL, etc.Running software from ordinary GCC-based pkgsrc packages on a cluster orgrid reduces months or years of computation to hours or days, andanother 20% speedup isn't worth even a modest investment of our time.The cost of the extra core-hours is a fraction of the cost of our timeto optimize every build, plus we can usually deploy things much soonerusing existing packages.

So, my focus is on creating portable pkgsrc packages that can be quicklydeployed on our CentOS clusters and at the same time can be leveraged byusers of NetBSD, Dragonfly BSD, Darwin, etc. I think the larger thecollection of scientific packages becomes, the more people in theresearch community will be encouraged to join the cause, which willultimately benefit everyone, regardless of which OS they use.


Cheers,

    Jason

On 7/7/15 4:24 PM, Thomas Orgis wrote:

Am Tue, 07 Jul 2015 15:28:49 -0500
schrieb Jason Bacon <jwbacon%tds.net@localhost>:

On our HPC clusters, I simply bootstrap a whole new tree about every 6
to 12 months to make newer software versions available.

Ah, so we're indeed on the same page there. We are deploying our first
big setup based on pkgsrc for common software, but always thinking
about other stuff on top.

One tricky thing is how to handle differing compilers, especially since
C++ and Fortran modules are not compatible between them. One solution
is simply not to use those and write wrappers over C in your own code,
but do you happen to deal with getting stuff like HDF5 from pkgsrc with
intel/pgi compilers?

The we got differing MPI implementations. Various commercial software
on top. We want to offer the whole deal and are having endless debates
on how to do it best. Perhaps we should at some point have a longer
discussion with you, too. Now, we're really busy getting a fresh system up
and running, of course with an elaborate structure of environment
modules.

Older trees are
left in place so researchers can finish up projects using the same
version of a package, but eventually deprecated.

We will never delete user software for the lifetime of the system (unless
there is a _really_ nasty security risk from just having it around).
But well, we won't carry all old versions onto the next setup.

I have a lot of scientific packages in wip and more coming, but too
little time to devote to it.

Ah, so you helped us getting some of tha geography stuff going? ;-)

Btw.: I wonder if it makes sense having thousands of TexLive packages
in pkgsrc. It's such a huge collection of packages that actually comes
with its own package manager. It lends itself well to installation in a
separate prefix anyway. In our world, there are various separate
packages in addition to pkgsrc anyway. Pkgsrc takes the place of the
normal GNU/Linux userspace, on top of which specialist software is
installed.

  One of my colleagues here is learning to
package and may join pkgsrc-wip soon.

Yes, if this really works out for us in the long term, I might start
contributing packages, too. Though, there probably always will be
standalone packages we build in-house. I see the need for pkgsrc with
the wildly interdependent stuff.

  There's a lot of work to be done
in categories like math and biology, though. Fortran support needs some
work as well.

Do you have R built with proper BLAS (perhaps even Intel MKL?), and
possibly MPI from pkgsrc? Folks are using this software more and more,
as the field of application of HPC clusters widens.

But, well, let's continue that on a separate thread perhaps, in some
weeks when I can breathe again (*preparing yet another compute node
image*).


Alrighty then,

Thomas



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Jason W. Bacon
  jwbacon%tds.net@localhost

  If a problem can be solved,
  there's no need to worry.

If it cannot be solved, then

  worrying will do no good.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

References:
- Moving pkgsrc-wip away from SourceForge
  - From: Benny Siegert
- Re: Moving pkgsrc-wip away from SourceForge
  - From: Greg Troxel
- Re: Moving pkgsrc-wip away from SourceForge
  - From: Benny Siegert
- Re: Moving pkgsrc-wip away from SourceForge
  - From: Joerg Sonnenberger
- Re: Moving pkgsrc-wip away from SourceForge
  - From: Mayuresh
- Re: Moving pkgsrc-wip away from SourceForge
  - From: Greg Troxel
- Re: Moving pkgsrc-wip away from SourceForge
  - From: Mayuresh
- Re: Moving pkgsrc-wip away from SourceForge
  - From: Thomas Orgis
- Re: Moving pkgsrc-wip away from SourceForge
  - From: Jason Bacon
- Re: Moving pkgsrc-wip away from SourceForge
  - From: Thomas Orgis
- Re: [changed subject to] pkgsrc in scientific computing
  - From: Jason Bacon
- Re: [changed subject to] pkgsrc in scientific computing
  - From: Thomas Orgis

Prev by Date: Re: [changed subject to] pkgsrc in scientific computing
Next by Date: Re: XZCAT: parameter not set
Previous by Thread: Re: [changed subject to] pkgsrc in scientific computing
Next by Thread: Re: Moving pkgsrc-wip away from SourceForge
Indexes:

Home | Main Index | Thread Index | Old Index