Am Mon, 12 Mar 2018 22:26:55 -0500 schrieb Jason Bacon <outpaddling%yahoo.com@localhost>: > There have always been differences in how dependent > software > functions with different BLAS/LAPACK implementations and my world view > will be > altered if that ever changes. Can you give examples? I do not mean parallelization issues. So far, I can only imagine a certain BLAS being buggy in a place that affects only a few applications. I'd expect a fix for that bug (minor version bump) should have fixed that, then. Are there examples of applications that simply won't work properly with certain BLAS, regardless of patchlevel? > What I envision is a mk/blas interface for agnostic dependents > and the ability for packages to bypass it entirely and go > straight to, say, openblas/buildlink3.mk. While I agree that the option should exist, I insist that the package should use a default blas if the admin set one, unless the admin also sets a package option that overrides it. So, I am OK with a non-default per-package option to override a global BLAS choice. That option could be phrased in a way to indicate that there is a preference for a certain BLAS for that package. You'd be fine with that? > One of the issues this raises is that defaulting to something other than > Netlib BLAS will be necessary to achieve good performance. But some > applications have issues with OpenBLAS while others have issues with ATLAS. See above … data on these? That might also go on that wiki page, if it exists … > The packager will become aware of this and should have the ability to select > certain high-performance implementations while blacklisting others. Hm. There could be BLAS_RECOMMENED or similar for the package to set, at least giving a warning if the admin default does not match. > Do you have any data on the performance gains of O3 over O2? > For most programs it's around zero, but if it actually makes a difference > for BLAS, we should deal with this. We, a hint was contained in my tests: You had my naive multiplication loop compiled with -O3 needing 80% of the runtime of Netlib BLAS with -O2. Again with -march=native -O3 (for AVX fun on the Ivy Bridge): shell$ LD_PRELOAD=/usr/lib/libblas.so ./matrix_mult_single 2000 dgemm 2000x2000: time(simple) = .718 * time(BLAS) So it needs about 70% of the runtime now, where Netlib was built using -march=native -O2. Level 3 is very interesting with gcc as it enables the auto-vectorization. With O2 you don't get your loops turned into AVX at all. Thing is, that alone would mean that we are talking about a factor of 4 or higher for wider AVX, possibly FMA. But since things are often limited by the memory bandwidth, the gain looks more modest unless the compiler knows that things fit into caches. I whitnessed that with my small matrix example. Without knowing the matrix size, gcc will not use the packed AVX instructions and only get modest gains with O3. But I can get factor 2 to 3 by fixing a small problem size in the code or even by doing this: if(n <= 64) then the_loop_for_multiplication else the_loop_for_multiplication end if I have the exact same code twice in there. My GCC 7.2 uses the information of n <= 64 in the first branch to assume that data is in caches and starts packing vectors for mul/add. The second branch does not get that treatment (I counted AVX instructions with objdump). Long story short: By itself, O3 does not bring _that_ much of an improvement. And as it may be dangerous with Netlib code, we probably should let the O2 stay in there. The real gains come from using SSE/AVX and ensuring that the problem is cut into pieces that utilize the caches to not be limited by main memory bandwidth. Especially intel CPUs can chew much more than they can bite off at a time. All the vector units in the world are useless when you don't feed them enough data. That is why the gap between theoretical peak performance and actual application performance is so wide nowadays. Working around the limited memory bandwidth is the main work that went into the optimized BLAS libraries. > > PS: We might want to have a separate package for lapack-manpages. They > > come in a separate tarball and I didn't see them in openblas. > > > That may be a good idea. Is there only one source for open source > lapack manpages as far I would be very surprised to find out that someone rewrote the official manpages from netlib. They are the standard that the LAPACK implementations adhere to. Alrighty then, Thomas -- Dr. Thomas Orgis Universität Hamburg RRZ / Basis-Infrastruktur / HPC Schlüterstr. 70 20146 Hamburg Tel.: 040/42838 8826 Fax: 040/428 38 6270
Attachment:
smime.p7s
Description: S/MIME cryptographic signature