tech-pkg archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Deciding on wich variant(s) of OpenBLAS library to install



Am Mon, 12 Mar 2018 12:48:55 -0500
schrieb Jason Bacon <outpaddling%yahoo.com@localhost>:

> Trying to make them interchangeable would be somewhat like 
> trying to make all POSIX operating systems interchangeable, but on a 
> smaller scale.

This is a bit harsh. Sure, POSIX systems _are_ interchangeable, to some
degree, but the complexities of differing BLAS implementations only
really start once you include parallelization which changes the game in
case you link to an application that also does some kind of
parallelization.

The job of getting serial BLAS interchangeable needs some work, but it
has been done already: Look at Debian.

	https://packages.debian.org/stretch/libblas.so.3

When you look closer at their libopenblas:

/usr/lib/libopenblas.so.0
/usr/lib/libopenblasp-r0.2.19.so
/usr/lib/openblas-base/libblas.so.3
/usr/lib/openblas-base/liblapack.so.3
/usr/share/doc/libopenblas-base/README.Debian
/usr/share/doc/libopenblas-base/TODO.Debian
/usr/share/doc/libopenblas-base/changelog.Debian.gz
/usr/share/doc/libopenblas-base/changelog.gz
/usr/share/doc/libopenblas-base/copyright

You notice that they have namespaced wrapper libs with the generic name
libblas.so.3 and liblapack.so.3 (they come together nowadays) and
switch among the variants via the alternatives system.

What you also should notice is the libopenblasp file. This is a
parallel build and the parallel variants are not included in the
generic BLAS abstraction at all right now. They did have issues with
libopenblasp switching between OpenMP and pthreads … different things
broke with each of them. The situation is better nowadays, but the
potential for conflict of the paralellization with a host application
or other libs is of course still present. That is why I would not make
a parallel BLAS the default setting. Only the local admin may choose to
do so in the installation, or just choose it in a certain package like
Jason is suggesting. I am stressing that the local admin shall make that
choice, not the packager.

I am not proposing (anymore) that we really go the interchangeable
libblas route in pkgsrc. We don't have to, as pkgsrc can (re)build things
from source. I'd just like my global switch and am fine with offering
all variants of BLAS to the users at the same time for building their
own software.

> The most popular in my experience are Netlib BLAS (currently the de 
> facto standard and the only implementation in pkgsrc), ATLAS, OpenBLAS, 
> CBLAS

Netlib BLAS is not the de facto standard. It _is_ _the_ standard. It is
the reference implementation of it, at least. Things like MKL may add
functionality, but when they implement the functions of the BLAS set,
they re-implement what Netlib does. You take Netlib and start
optimizing the code, or at least the algorithm, to your architecture.

And, correct me if I'm really wrong, but CBLAS is _not_ a blas
implementation. It is a standard for a C interface to a BLAS
implementation. As is LAPACKE for the LAPACK part. OpenBLAS does come
with a libcblas, but we can disable it and install the generic package.
Not much lost by that. Same with lapacke linking against liblapack
(which is identical with libopenblas).

We're talking about choice of BLAS+LAPACK, as they belong together
nowadays. On top of that choice you can install the C wrapper libraries
libcblas and libpapacke.

> I agree that it would be nice to have a tunable default for 
> BLAS-agnostic dependent packages, but some dependent packages will 
> always rely on extensions or performance-enhancing features (e.g. 
> openmp) of a particular implementation.

Does the pkgsrc framework make this easy? When I introduce the global
variable for the BLAS, would you introduce PKG_OPTIONS.somepacke to
override the default (not including mk/blas.b3.mk but
math/openblas-openmp.b3.mk directly when the non-default option is
set)? Or just call bmake once with an overridden setting?

> There is *A LOT* of scientific software that depends on BLAS. Some works 
> better with OpenBLAS, others work better with ATLAS or MKL.

The relative performance between the implementations varies with the
actual use case (size of problems), sure … but the elephant in the room
is this:

> Most will 
> work OK with Netlib BLAS, but performance will likely be subpar and many 
> HPC users are averse to it.

Subpar is an understatement. If your computation spends most of its
time in BLAS+LAPACK, you are just wasting CPU cycles (taxpayer's money)
in an HPC system when using Netlib BLAS. You use Netlib BLAS during
development and testing or when the matrix math is only a small part of
what you do (the case with my own atmospheric model, where I only
invert small matrixes once at the beginning … I'm not even using BLAS
for that). You simply do not use Netlib BLAS in production where CPU
time is of any interest.

When you make Netlib BLAS faster by optimizing the code for cache
sizes, vector instructions … unless that's just some compiler switches,
you end up with something that is not Netlib BLAS anymore. That's how
you get to OpenBLAS and friends. They are not competing with Netlib for
anything. One is the correct reference, the other is a fast application
of the same algorithm. The variants are not going to go away.

Actually I am a bit amazed how bad an un-tuned BLAS works, even with
compilers that can do auto-vectorization*. There is a lot to gain my
restructuring things to fit in caches, to really feed the vector units
in time. Small comparison with the attached program, just for fun:

shell$ # Installed OpenBLAS libs in /stuff/src/OpenBLAS-0.2.20.libs
shell$ gcc -O -c wall_time.c
shell$ gfortran -O3 -o matrix_mult_single \
  -L/stuff/src/OpenBLAS-0.2.20.libs -Wl,-R/stuff/src/OpenBLAS-0.2.20.libs \
  -lopenblas_single \
  main.f90 matrix_mult.f90 wall_time.o
shell$ ./matrix_mult_single 2000
dgemm 2000x2000: time(simple) = 8.262 * time(BLAS)
shell$ # testing Netlib BLAS
shell$ LD_PRELOAD=/usr/lib/libblas.so.3 ./matrix_mult_single 2000
dgemm 2000x2000: time(simple) = .882 * time(BLAS)
shell$ # testing parallel OpenBLAS (4 cores)
shell$ LD_PRELOAD=/stuff/src/OpenBLAS-0.2.20.libs/libopenblas_openmp.so \
  ./matrix_mult_single 2000
dgemm 2000x2000: time(simple) = 27.278 * time(BLAS)

To sum up the gibberish: Netlib BLAS needs even more time than the
simple nested loops doing matrix multiplication, probably due to Netlib
being built with -O2 instead of -O3. Serial OpenBLAS is 8 times as
fast, and raises that to 27 times on a 4-core CPU. So, factor 8 (or
even 10) per core … buy a cluster for 125000 € and use OpenBLAS or use
Netlib and spend 1000000 € to get the same work done. This is a lot
bigger deal than what people get from switching compilers.

Btw., some more perspective: Up to about 39×39 matrices, the naive
nested loops are faster than calling any BLAS here. We are really only
talking about problems involving bigger matrices and vectors. There are
funny effects with certain problem sizes as they jump in and out of
caches …

I just wanted to give a little reference of the relative performance we
are talking about here. Computer science folks sometimes can have a
hard time believing that the same basic algorithm can have such huge
performance differences. They're busy counting and sorting the leaves
on their pretty trees while physicists are crunching numbers;-) Getting
some optimized choice of BLAS into pkgsrc (or the option to use an
external one) is an absolute must to be of any use in HPC besides for
simply providing things like Gtk+ or HDF5.

> Allowing only one BLAS implementation to be installed at a time would 
> effectively eliminate pkgsrc as a viable deployment tool for 
> BLAS-dependent software.

Well, unless you bootstrap different prefixes. But you convinced me
that it makes sense and does not have to hurt to install the differing
variants next to each other.


Alrighty then,

Thomas

* Actually, I just observed that Netlib LAPACK (which also ships a copy
  of BLAS) by default strips -O3 down to -O2 in a release build. So
  auto-vectorization is not even present. But then, compilers easily
  fail optimizing Netlib stuff and produce broken code. This is a human's
  job:-(

PS: We might want to have a separate package for lapack-manpages. They
come in a separate tarball and I didn't see them in openblas.

-- 
Dr. Thomas Orgis
Universität Hamburg
RRZ / Basis-Infrastruktur / HPC
Schlüterstr. 70
20146 Hamburg
Tel.: 040/42838 8826
Fax: 040/428 38 6270
program main

  implicit none
  integer :: n
  real(8), dimension (:, :), allocatable :: a, b, c
  real(8), parameter :: zero = 1.0
  real(8), parameter :: one = 1.0
  real(8), external :: wall_time
  real(8) :: time1, time2
  character(10) :: arg

  if (command_argument_count() /= 1) then
     write(0,*) "Usage: matrix_mult N"
     call exit(1)
  endif

  call get_command_argument(1, arg)
  read(arg, *) n

  allocate(a(n, n))
  allocate(b(n, n))
  allocate(c(n, n))

  if (n == 2) then
     a(1, 1) = 1
     a(2, 1) = 2
     a(1, 2) = 3
     a(2, 2) = 4

     b(1, 1) = 5
     b(2, 1) = 6
     b(1, 2) = 7
     b(2, 2) = 8
     write(6, "(4f8.1)") a
     write(6, "(4f8.1)") b
  else
     a(:, :) = 1
     b(:, :) = 2
     c(:, :) = 0
  endif

  time1 = wall_time()
  call matrix_mult(n, a, b, c)
  time1 = wall_time() - time1

  if (n == 2) then
     write(6, "(4f8.1)") c
     c(:, :) = 0
  else
     a(:, :) = 1
     b(:, :) = 2
     c(:, :) = 0
  endif

  time2 = wall_time()
  call dgemm('n', 'n', n, n, n, one, a, n, b, n, zero, c, n)
  time2 = wall_time() - time2

  if (n == 2) then
     write(6, "(4f8.1)") c
  else
     write(6,"(a,2(i0,a),f0.3,a)") "dgemm ", n, "x", n, ": time(simple) = ", time1 / time2, " * time(BLAS)"
  endif

end

program main

  implicit none
  integer :: n
  real(8), dimension (:, :), allocatable :: a, b, c
  real(8), parameter :: zero = 1.0
  real(8), parameter :: one = 1.0
  character(10) :: arg

  if (command_argument_count() /= 1) then
     write(0,*) "Usage: matrix_mult N"
     call exit(1)
  endif

  call get_command_argument(1, arg)
  read(arg, *) n

  allocate(a(n, n))
  allocate(b(n, n))
  allocate(c(n, n))

  a(1, 1) = 1
  a(2, 1) = 2
  a(1, 2) = 3
  a(2, 2) = 4

  b(1, 1) = 5
  b(2, 1) = 6
  b(1, 2) = 7
  b(2, 2) = 8

  call matrix_mult(n, a, b, c)

  write(6, "(4f8.1)") a
  write(6, "(4f8.1)") b
  write(6, "(4f8.1)") c

  c(:, :) = 0

  !!!call dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)

  call dgemm('n', 'n', n, n, n, one, a, n, b, n, zero, c, n)

  write(6, "(4f8.1)") c

end

# include <sys/time.h>

double wall_time(void)
{
    struct timeval val;
    struct timezone zone;

    gettimeofday(&val, &zone);

    return (double) val.tv_sec + (double) val.tv_usec * 1e-6;
}

double wall_time_(void)  /* for calls from Fortran */
{
    struct timeval val;
    struct timezone zone;

    gettimeofday(&val, &zone);

    return (double) val.tv_sec + (double) val.tv_usec * 1e-6;
}

Attachment: smime.p7s
Description: S/MIME cryptographic signature



Home | Main Index | Thread Index | Old Index