On Thu, 19 Nov 2009 13:52:53 +0100
Johnny Billquist <bqt%softjar.se@localhost> wrote:
WHat I don't really understand why you are creating that kind of
scenario. Seems like you have designed things in a backward way from
the start.
There is no way with pthreads to do what you want. You can wait on
condition variables, but as you noted, this means also an associated
mutex, which must be the same for all threads.
But what are you really doing which requires that all 256 (or
whatever number) of threads needs to be woken up and started exactly
in parallel?
Seems to me that this is where you maybe should rethink. Do you
really have data for all 256 jobs ready at the same time, to be
started at the same time?
Have your worked threads each have their own condvar and mutex
instead. and fire each one off as soon as enough data for that thread
to work on is available?
Or don't even create the thread until the data is there for it to
work on, and don't do any waits at all.
Obviously I don't know enough about your program to say how it
can/should be done, but I can think of several alternative ways of
doing most stuff, compared to trying to start 256 threads exactly
simultaneously.
OK, let me elaborate a bit more.
You have a powerful Multicore/SMP system that supports a total of 256
parallel threads (which will be common in a few years time).
Say you have a large music file 'song1.wav' that you want to convert to
'song1.mp3'. This can also have other uses, e.g. data compression,
encryption, etc.
You call 'mmap()' function to map a big chunk (or a whole file) into
your process' adress space. Also lets assume that the pages for this
file were cached in memory and are still valid. So, when mmap()
returns, you have a large memory buffer that contains all the data you
need, hence no need for slow disk I/O.
Provided you have enough memory, what you would like to do now is
allocate a large output buffer, utilise 256 parallel threads to read
input buffer, convert data into mp3 format and write it to output
buffer. When data conversion is finished, you allocate writing output
buffer to disk (file1.mp3) to a slow I/O thread in the background, so
that other threads can carry on with compute-bound tasks.
The problem now is how to find the most efficient way to partition input
buffer into 256 segments and how to allocate jobs to 256 parallel
threads.
There are different methods you can use, i.e. have a job queue and
assign one job at a time to a queue and then call one thread to process
each job, etc.
I was looking for a way to minimise the number of times you need to call
pthread_cond_signal(), or pthread_mutex_lock(). It doesn't make much
difference on a 4-core CPU, but as the number of cores increases, the
overheads of each function call will quickly add up.
I thought a good way to start would be to assign thread_id number to
each thread. So each thread would have thread_id from 0 to 255. Then
all you do is provide a pointer to input and output buffers and call
pthread_cond_broadcast() to wake up all threads. Each thread would look
at its thread_id and based on that read/write input/output buffer only
from the segment number that corresponds to thread_id. This way threads
don't collide and don't modify shared data.
There are a lot of details to work out, but the biggest issue is the
way in which pthread_cond_wait() is implemented. I think that calling
pthread_create() each time for parallel region will incur too much
overhead, but then if I pre-spawn all threads that call
pthread_cond_wait(), when woken up each of those threads will lock a
mutex, preventing other threads from running until this mutex is
unlocked.
The best workaround I came up with, was to divide all threads into
groups, so 256 parallel threads would be arranged into something like
32 groups of 8 threads per group. Then all 8 threads in a group would
sleep on the same condition variable, minimising the 'thundering herd'
problem. You would have to step through 32 groups and broadcast to all
8 threads in a group.
Does this make sense, or I'm doing something completely wrong?