Am Mon, 12 Mar 2018 12:48:55 -0500 schrieb Jason Bacon <outpaddling%yahoo.com@localhost>: > Trying to make them interchangeable would be somewhat like > trying to make all POSIX operating systems interchangeable, but on a > smaller scale. This is a bit harsh. Sure, POSIX systems _are_ interchangeable, to some degree, but the complexities of differing BLAS implementations only really start once you include parallelization which changes the game in case you link to an application that also does some kind of parallelization. The job of getting serial BLAS interchangeable needs some work, but it has been done already: Look at Debian. https://packages.debian.org/stretch/libblas.so.3 When you look closer at their libopenblas: /usr/lib/libopenblas.so.0 /usr/lib/libopenblasp-r0.2.19.so /usr/lib/openblas-base/libblas.so.3 /usr/lib/openblas-base/liblapack.so.3 /usr/share/doc/libopenblas-base/README.Debian /usr/share/doc/libopenblas-base/TODO.Debian /usr/share/doc/libopenblas-base/changelog.Debian.gz /usr/share/doc/libopenblas-base/changelog.gz /usr/share/doc/libopenblas-base/copyright You notice that they have namespaced wrapper libs with the generic name libblas.so.3 and liblapack.so.3 (they come together nowadays) and switch among the variants via the alternatives system. What you also should notice is the libopenblasp file. This is a parallel build and the parallel variants are not included in the generic BLAS abstraction at all right now. They did have issues with libopenblasp switching between OpenMP and pthreads … different things broke with each of them. The situation is better nowadays, but the potential for conflict of the paralellization with a host application or other libs is of course still present. That is why I would not make a parallel BLAS the default setting. Only the local admin may choose to do so in the installation, or just choose it in a certain package like Jason is suggesting. I am stressing that the local admin shall make that choice, not the packager. I am not proposing (anymore) that we really go the interchangeable libblas route in pkgsrc. We don't have to, as pkgsrc can (re)build things from source. I'd just like my global switch and am fine with offering all variants of BLAS to the users at the same time for building their own software. > The most popular in my experience are Netlib BLAS (currently the de > facto standard and the only implementation in pkgsrc), ATLAS, OpenBLAS, > CBLAS Netlib BLAS is not the de facto standard. It _is_ _the_ standard. It is the reference implementation of it, at least. Things like MKL may add functionality, but when they implement the functions of the BLAS set, they re-implement what Netlib does. You take Netlib and start optimizing the code, or at least the algorithm, to your architecture. And, correct me if I'm really wrong, but CBLAS is _not_ a blas implementation. It is a standard for a C interface to a BLAS implementation. As is LAPACKE for the LAPACK part. OpenBLAS does come with a libcblas, but we can disable it and install the generic package. Not much lost by that. Same with lapacke linking against liblapack (which is identical with libopenblas). We're talking about choice of BLAS+LAPACK, as they belong together nowadays. On top of that choice you can install the C wrapper libraries libcblas and libpapacke. > I agree that it would be nice to have a tunable default for > BLAS-agnostic dependent packages, but some dependent packages will > always rely on extensions or performance-enhancing features (e.g. > openmp) of a particular implementation. Does the pkgsrc framework make this easy? When I introduce the global variable for the BLAS, would you introduce PKG_OPTIONS.somepacke to override the default (not including mk/blas.b3.mk but math/openblas-openmp.b3.mk directly when the non-default option is set)? Or just call bmake once with an overridden setting? > There is *A LOT* of scientific software that depends on BLAS. Some works > better with OpenBLAS, others work better with ATLAS or MKL. The relative performance between the implementations varies with the actual use case (size of problems), sure … but the elephant in the room is this: > Most will > work OK with Netlib BLAS, but performance will likely be subpar and many > HPC users are averse to it. Subpar is an understatement. If your computation spends most of its time in BLAS+LAPACK, you are just wasting CPU cycles (taxpayer's money) in an HPC system when using Netlib BLAS. You use Netlib BLAS during development and testing or when the matrix math is only a small part of what you do (the case with my own atmospheric model, where I only invert small matrixes once at the beginning … I'm not even using BLAS for that). You simply do not use Netlib BLAS in production where CPU time is of any interest. When you make Netlib BLAS faster by optimizing the code for cache sizes, vector instructions … unless that's just some compiler switches, you end up with something that is not Netlib BLAS anymore. That's how you get to OpenBLAS and friends. They are not competing with Netlib for anything. One is the correct reference, the other is a fast application of the same algorithm. The variants are not going to go away. Actually I am a bit amazed how bad an un-tuned BLAS works, even with compilers that can do auto-vectorization*. There is a lot to gain my restructuring things to fit in caches, to really feed the vector units in time. Small comparison with the attached program, just for fun: shell$ # Installed OpenBLAS libs in /stuff/src/OpenBLAS-0.2.20.libs shell$ gcc -O -c wall_time.c shell$ gfortran -O3 -o matrix_mult_single \ -L/stuff/src/OpenBLAS-0.2.20.libs -Wl,-R/stuff/src/OpenBLAS-0.2.20.libs \ -lopenblas_single \ main.f90 matrix_mult.f90 wall_time.o shell$ ./matrix_mult_single 2000 dgemm 2000x2000: time(simple) = 8.262 * time(BLAS) shell$ # testing Netlib BLAS shell$ LD_PRELOAD=/usr/lib/libblas.so.3 ./matrix_mult_single 2000 dgemm 2000x2000: time(simple) = .882 * time(BLAS) shell$ # testing parallel OpenBLAS (4 cores) shell$ LD_PRELOAD=/stuff/src/OpenBLAS-0.2.20.libs/libopenblas_openmp.so \ ./matrix_mult_single 2000 dgemm 2000x2000: time(simple) = 27.278 * time(BLAS) To sum up the gibberish: Netlib BLAS needs even more time than the simple nested loops doing matrix multiplication, probably due to Netlib being built with -O2 instead of -O3. Serial OpenBLAS is 8 times as fast, and raises that to 27 times on a 4-core CPU. So, factor 8 (or even 10) per core … buy a cluster for 125000 € and use OpenBLAS or use Netlib and spend 1000000 € to get the same work done. This is a lot bigger deal than what people get from switching compilers. Btw., some more perspective: Up to about 39×39 matrices, the naive nested loops are faster than calling any BLAS here. We are really only talking about problems involving bigger matrices and vectors. There are funny effects with certain problem sizes as they jump in and out of caches … I just wanted to give a little reference of the relative performance we are talking about here. Computer science folks sometimes can have a hard time believing that the same basic algorithm can have such huge performance differences. They're busy counting and sorting the leaves on their pretty trees while physicists are crunching numbers;-) Getting some optimized choice of BLAS into pkgsrc (or the option to use an external one) is an absolute must to be of any use in HPC besides for simply providing things like Gtk+ or HDF5. > Allowing only one BLAS implementation to be installed at a time would > effectively eliminate pkgsrc as a viable deployment tool for > BLAS-dependent software. Well, unless you bootstrap different prefixes. But you convinced me that it makes sense and does not have to hurt to install the differing variants next to each other. Alrighty then, Thomas * Actually, I just observed that Netlib LAPACK (which also ships a copy of BLAS) by default strips -O3 down to -O2 in a release build. So auto-vectorization is not even present. But then, compilers easily fail optimizing Netlib stuff and produce broken code. This is a human's job:-( PS: We might want to have a separate package for lapack-manpages. They come in a separate tarball and I didn't see them in openblas. -- Dr. Thomas Orgis Universität Hamburg RRZ / Basis-Infrastruktur / HPC Schlüterstr. 70 20146 Hamburg Tel.: 040/42838 8826 Fax: 040/428 38 6270
program main implicit none integer :: n real(8), dimension (:, :), allocatable :: a, b, c real(8), parameter :: zero = 1.0 real(8), parameter :: one = 1.0 real(8), external :: wall_time real(8) :: time1, time2 character(10) :: arg if (command_argument_count() /= 1) then write(0,*) "Usage: matrix_mult N" call exit(1) endif call get_command_argument(1, arg) read(arg, *) n allocate(a(n, n)) allocate(b(n, n)) allocate(c(n, n)) if (n == 2) then a(1, 1) = 1 a(2, 1) = 2 a(1, 2) = 3 a(2, 2) = 4 b(1, 1) = 5 b(2, 1) = 6 b(1, 2) = 7 b(2, 2) = 8 write(6, "(4f8.1)") a write(6, "(4f8.1)") b else a(:, :) = 1 b(:, :) = 2 c(:, :) = 0 endif time1 = wall_time() call matrix_mult(n, a, b, c) time1 = wall_time() - time1 if (n == 2) then write(6, "(4f8.1)") c c(:, :) = 0 else a(:, :) = 1 b(:, :) = 2 c(:, :) = 0 endif time2 = wall_time() call dgemm('n', 'n', n, n, n, one, a, n, b, n, zero, c, n) time2 = wall_time() - time2 if (n == 2) then write(6, "(4f8.1)") c else write(6,"(a,2(i0,a),f0.3,a)") "dgemm ", n, "x", n, ": time(simple) = ", time1 / time2, " * time(BLAS)" endif end
program main implicit none integer :: n real(8), dimension (:, :), allocatable :: a, b, c real(8), parameter :: zero = 1.0 real(8), parameter :: one = 1.0 character(10) :: arg if (command_argument_count() /= 1) then write(0,*) "Usage: matrix_mult N" call exit(1) endif call get_command_argument(1, arg) read(arg, *) n allocate(a(n, n)) allocate(b(n, n)) allocate(c(n, n)) a(1, 1) = 1 a(2, 1) = 2 a(1, 2) = 3 a(2, 2) = 4 b(1, 1) = 5 b(2, 1) = 6 b(1, 2) = 7 b(2, 2) = 8 call matrix_mult(n, a, b, c) write(6, "(4f8.1)") a write(6, "(4f8.1)") b write(6, "(4f8.1)") c c(:, :) = 0 !!!call dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call dgemm('n', 'n', n, n, n, one, a, n, b, n, zero, c, n) write(6, "(4f8.1)") c end
# include <sys/time.h> double wall_time(void) { struct timeval val; struct timezone zone; gettimeofday(&val, &zone); return (double) val.tv_sec + (double) val.tv_usec * 1e-6; } double wall_time_(void) /* for calls from Fortran */ { struct timeval val; struct timezone zone; gettimeofday(&val, &zone); return (double) val.tv_sec + (double) val.tv_usec * 1e-6; }
Attachment:
smime.p7s
Description: S/MIME cryptographic signature