tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Memory alignment

I wrote a simple program to see how memory alignment affects
performance on SMP hardware. I ran it on dual Pentium 3 machine, as far
as I know Pentium 3 has 32-byte cache line.

The program creates 2 threads and runs them in parallel. Each thread
increments a variable in a loop 100 million times. I intentionally did
not create a local copy of the variables for each thread, because I
wanted to see the effect of false sharing.

I also had to turn of GCC optimisations:
 gcc -O0 test.c -lpthread

Below is the average time I got when two integers were aligned on
32-byte boundary:
p3smp$ time ./a.out
&num0_ptr=0xbfbfe6c4, num0_ptr % 32=0
&num1_ptr=0xbfbfe6c0, num1_ptr % 32=0

Delta time usec = 981357
        0.98 real         1.91 user         0.00 sys

Below is the average time I got when two integers were not aligned on
32-byte boundary:
p3smp$ time ./a.out
&num0_ptr=0xbfbfe6b4, num0_ptr % 32=28
&num1_ptr=0xbfbfe6b0, num1_ptr % 32=24

Delta time usec = 4904224
        4.90 real         9.66 user         0.00 sys

As you can see it takes almost 5 times longer to execute the same code
when two integers sit next to each other and not aligned on 32-byte
boundary. I guess the difference will be even more dramatic when
executed on a machine with more processors.

Apart from allocating memory dynamically via posix_memalign(), does
anyone know if there are other ways of making sure the variables are
aligned on 32 or 64-byte boundary? I've tried GCC aligned attribute,
but then it does not guarantee data alignment to that particular
boundary. Quite often it will align to 8 or 16 bytes, etc.
#include <pthread.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define ALIGNED

/* Thread start function */
void *tfunc(void *arg)
        uint32_t *num = (uint32_t *)arg;
        const int max = 100000000;
        int i;

        for(i = 0; i < max; i++)
                if(*num == max)
                        *num = 0;

        return NULL;

int main(void)
        pthread_t tid0, tid1;
        struct timespec tstart, tend;
        uint64_t dtime;

#ifdef ALIGNED
        /* Aligned on 32-byte boundary */
        uint32_t *num0_ptr, *num1_ptr;
        posix_memalign((void *)&num0_ptr, 32, sizeof(uint32_t));
        posix_memalign((void *)&num1_ptr, 32, sizeof(uint32_t));
        *num0_ptr = *num1_ptr = 0;
        /* Misaligned */
        uint8_t pad[7];
        uint32_t num0, num1;
        num0 = num1 = 0;
        uint32_t *num0_ptr = &num0;
        uint32_t *num1_ptr = &num1;

        setbuf(stdout, NULL);
        printf("&num0_ptr=%p, ", &num0_ptr);
        printf("num0_ptr %% 32=%u\n", (int)num0_ptr % 32);
        printf("&num1_ptr=%p, ", &num1_ptr);
        printf("num1_ptr %% 32=%u\n", (int)num1_ptr % 32);

        clock_gettime(CLOCK_REALTIME, &tstart);
        pthread_create(&tid0, NULL, &tfunc, num0_ptr);
        pthread_create(&tid1, NULL, &tfunc, num1_ptr);
        pthread_join(tid0, NULL);
        pthread_join(tid1, NULL);
        clock_gettime(CLOCK_REALTIME, &tend);

        convert end and start times to microseconds and calcualte the difference
        dtime = (((uint64_t)tend.tv_sec * 1000000) +
                        ((uint64_t)tend.tv_nsec / 1000)) -

                        (((uint64_t)tstart.tv_sec * 1000000) +
                        ((uint64_t)tstart.tv_nsec / 1000));

        printf("\nDelta time usec = %llu\n", dtime);


Home | Main Index | Thread Index | Old Index