tech-pkg archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Default value for OPENBLAS_THREAD_LIMIT



Am Tue, 01 Jun 2021 20:14:16 -0400
schrieb Greg Troxel <gdt%lexort.com@localhost>: 

> It strikes me that the #cores used should be a run-time config. 

The actual number of threads is determined at run-time by looking at
current CPU and environment variables, but there is some fixed storage
allocation. Upstream says:

# Note for package maintainers: you can build OpenBLAS with a large NUM_THREADS
# value (eg. 32-256) if you expect your users to use that many threads. Due to the way
# some internal structures are allocated, using a large NUM_THREADS value has a RAM
# footprint penalty, even if users reduce the actual number of threads at runtime.
# NUM_THREADS = 24

I assume there are performance reasons for not having certain
synchronization data structures as dynamic arrays. OpenBLAS is about
performance, and fixing some array sizes at build-time can really
help, especially if threads are involved. Flexibility hurts there.

> and this is the default with out a config, I'd pick something like 32,

As I said, this number would be fine, as would be others. Debian is
rolling with 64 right now. FreeBSD uses 64. SUSE uses 64 normally and
256 for HPC builds.

I'd go with 64 then as a consensus … large enough for most desktop
machines for some years to come, I presume, while 32 might be slim for
them Theadrippers. It is non-obvious how much impact a too-large number
can have when you need only few threads. The relevant data strutures
are like that:

./driver/level2/sbmv_thread.c:  blas_queue_t queue[MAX_CPU_NUMBER + 1];
./driver/level2/sbmv_thread.c:  BLASLONG range_m[MAX_CPU_NUMBER + 1];
./driver/level3/level3_gemm3m_thread.c:   BLASLONG working[MAX_CPU_NUMBER][CACHE_LINE_SIZE * DIVIDE_RATE];
./kernel/arm64/zdot_thunderx2t99.c:		char result[MAX_CPU_NUMBER * sizeof(double) * 2];

They are all over the place.

There's also this:

./driver/level3/level3_gemm3m_thread.c:
//The array of job_t may overflow the stack.
//Instead, use malloc to alloc job_t.
#if MAX_CPU_NUMBER > BLAS3_MEM_ALLOC_THRESHOLD
#define USE_ALLOC_HEAP
#endif

The current threshold value is 32, reason given as

        * Reduced the default BLAS3_MEM_ALLOC_THRESHOLD (used as an upper
          limit for placing temporary arrays on the stack) to be compatible
          with a stack size of 1mb (as imposed by the JAVA runtime library) 

So, 32 might be not a bad default after all. But you _can_ build a PC
nowadays with a Threadripper with 64 cores. I'm a bit torn.

We could for sure choose a smaller default for 32 bit platforms,
anyway (both core count and memory more constrained).

More voices?


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg


Home | Main Index | Thread Index | Old Index