tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Memory alignment
I wrote a simple program to see how memory alignment affects
performance on SMP hardware. I ran it on dual Pentium 3 machine, as far
as I know Pentium 3 has 32-byte cache line.
The program creates 2 threads and runs them in parallel. Each thread
increments a variable in a loop 100 million times. I intentionally did
not create a local copy of the variables for each thread, because I
wanted to see the effect of false sharing.
I also had to turn of GCC optimisations:
gcc -O0 test.c -lpthread
Below is the average time I got when two integers were aligned on
32-byte boundary:
p3smp$ time ./a.out
&num0_ptr=0xbfbfe6c4, num0_ptr % 32=0
&num1_ptr=0xbfbfe6c0, num1_ptr % 32=0
Delta time usec = 981357
0.98 real 1.91 user 0.00 sys
Below is the average time I got when two integers were not aligned on
32-byte boundary:
p3smp$ time ./a.out
&num0_ptr=0xbfbfe6b4, num0_ptr % 32=28
&num1_ptr=0xbfbfe6b0, num1_ptr % 32=24
Delta time usec = 4904224
4.90 real 9.66 user 0.00 sys
As you can see it takes almost 5 times longer to execute the same code
when two integers sit next to each other and not aligned on 32-byte
boundary. I guess the difference will be even more dramatic when
executed on a machine with more processors.
Apart from allocating memory dynamically via posix_memalign(), does
anyone know if there are other ways of making sure the variables are
aligned on 32 or 64-byte boundary? I've tried GCC aligned attribute,
but then it does not guarantee data alignment to that particular
boundary. Quite often it will align to 8 or 16 bytes, etc.
#include <pthread.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ALIGNED
/* Thread start function */
void *tfunc(void *arg)
{
uint32_t *num = (uint32_t *)arg;
const int max = 100000000;
int i;
for(i = 0; i < max; i++)
{
if(*num == max)
*num = 0;
(*num)++;
}
return NULL;
}
int main(void)
{
pthread_t tid0, tid1;
struct timespec tstart, tend;
uint64_t dtime;
#ifdef ALIGNED
/* Aligned on 32-byte boundary */
uint32_t *num0_ptr, *num1_ptr;
posix_memalign((void *)&num0_ptr, 32, sizeof(uint32_t));
posix_memalign((void *)&num1_ptr, 32, sizeof(uint32_t));
*num0_ptr = *num1_ptr = 0;
#else
/* Misaligned */
uint8_t pad[7];
uint32_t num0, num1;
num0 = num1 = 0;
uint32_t *num0_ptr = &num0;
uint32_t *num1_ptr = &num1;
#endif
setbuf(stdout, NULL);
printf("&num0_ptr=%p, ", &num0_ptr);
printf("num0_ptr %% 32=%u\n", (int)num0_ptr % 32);
printf("&num1_ptr=%p, ", &num1_ptr);
printf("num1_ptr %% 32=%u\n", (int)num1_ptr % 32);
clock_gettime(CLOCK_REALTIME, &tstart);
pthread_create(&tid0, NULL, &tfunc, num0_ptr);
pthread_create(&tid1, NULL, &tfunc, num1_ptr);
pthread_join(tid0, NULL);
pthread_join(tid1, NULL);
clock_gettime(CLOCK_REALTIME, &tend);
/*
convert end and start times to microseconds and calcualte the difference
*/
dtime = (((uint64_t)tend.tv_sec * 1000000) +
((uint64_t)tend.tv_nsec / 1000)) -
(((uint64_t)tstart.tv_sec * 1000000) +
((uint64_t)tstart.tv_nsec / 1000));
printf("\nDelta time usec = %llu\n", dtime);
pthread_exit(NULL);
}
Home |
Main Index |
Thread Index |
Old Index