tech-pkg archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Deciding on wich variant(s) of OpenBLAS library to install



Am Mon, 12 Mar 2018 22:26:55 -0500
schrieb Jason Bacon <outpaddling%yahoo.com@localhost>:

> There have always been differences in how dependent 
> software
> functions with different BLAS/LAPACK implementations and my world view 
> will be
> altered if that ever changes.

Can you give examples? I do not mean parallelization issues. So far, I
can only imagine a certain BLAS being buggy in a place that affects
only a few applications. I'd expect a fix for that bug (minor version
bump) should have fixed that, then. Are there examples of applications
that simply won't work properly with certain BLAS, regardless of
patchlevel?

> What I envision is a mk/blas interface for agnostic dependents
> and the ability for packages to bypass it entirely and go
> straight to, say, openblas/buildlink3.mk.

While I agree that the option should exist, I insist that the package
should use a default blas if the admin set one, unless the admin also
sets a package option that overrides it. So, I am OK with a non-default
per-package option to override a global BLAS choice. That option could
be phrased in a way to indicate that there is a preference for a
certain BLAS for that package. You'd be fine with that?

> One of the issues this raises is that defaulting to something other than
> Netlib BLAS will be necessary to achieve good performance.  But some
> applications have issues with OpenBLAS while others have issues with ATLAS.

See above … data on these? That might also go on that wiki page, if it
exists …

> The packager will become aware of this and should have the ability to select
> certain high-performance implementations while blacklisting others.

Hm. There could be BLAS_RECOMMENED or similar for the package to set,
at least giving a warning if the admin default does not match.

> Do you have any data on the performance gains of O3 over O2?
> For most programs it's around zero, but if it actually makes a difference
> for BLAS, we should deal with this.

We, a hint was contained in my tests: You had my naive multiplication
loop compiled with -O3 needing 80% of the runtime of Netlib BLAS with
-O2. Again with -march=native -O3 (for AVX fun on the Ivy Bridge):

shell$ LD_PRELOAD=/usr/lib/libblas.so ./matrix_mult_single 2000
dgemm 2000x2000: time(simple) = .718 * time(BLAS)

So it needs about 70% of the runtime now, where Netlib was built using
-march=native -O2. Level 3 is very interesting with gcc as it enables
the auto-vectorization. With O2 you don't get your loops turned into
AVX at all. Thing is, that alone would mean that we are talking about a
factor of 4 or higher for wider AVX, possibly FMA. But since things are
often limited by the memory bandwidth, the gain looks more modest
unless the compiler knows that things fit into caches. I whitnessed
that with my small matrix example. Without knowing the matrix size, gcc
will not use the packed AVX instructions and only get modest gains with
O3. But I can get factor 2 to 3 by fixing a small problem size in the
code or even by doing this:

if(n <= 64) then
  the_loop_for_multiplication
else
  the_loop_for_multiplication
end if

I have the exact same code twice in there. My GCC 7.2 uses the
information of n <= 64 in the first branch to assume that data is in
caches and starts packing vectors for mul/add. The second branch does
not get that treatment (I counted AVX instructions with objdump).

Long story short: By itself, O3 does not bring _that_ much of an
improvement. And as it may be dangerous with Netlib code, we probably
should let the O2 stay in there. The real gains come from using SSE/AVX
and ensuring that the problem is cut into pieces that utilize the
caches to not be limited by main memory bandwidth. Especially intel
CPUs can chew much more than they can bite off at a time. All the
vector units in the world are useless when you don't feed them enough
data. That is why the gap between theoretical peak performance and
actual application performance is so wide nowadays. Working around the
limited memory bandwidth is the main work that went into the optimized
BLAS libraries.

> > PS: We might want to have a separate package for lapack-manpages. They
> > come in a separate tarball and I didn't see them in openblas.
> >  
> That may be a good idea.  Is there only one source for open source 
> lapack manpages as far

I would be very surprised to find out that someone rewrote the official
manpages from netlib. They are the standard that the LAPACK
implementations adhere to.


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
Universität Hamburg
RRZ / Basis-Infrastruktur / HPC
Schlüterstr. 70
20146 Hamburg
Tel.: 040/42838 8826
Fax: 040/428 38 6270

Attachment: smime.p7s
Description: S/MIME cryptographic signature



Home | Main Index | Thread Index | Old Index