tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Is there a way to obtain a machine's cache line size?



On Fri, 21 Jan 2011 12:12:42 +0800
Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost> wrote:

> 
> On 21 Jan 2011, at 04:29 , Sad Clouds wrote:
> > On Thu, 20 Jan 2011 11:59:03 +0800
> > Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost> wrote:
> > 
> >> Hello,
> >> 
> >> Is there a way to obtain the correct cache line size for the
> >> machine code is running on, both in the kernel and at user level?
> >> I see there is a compile time constant CACHE_LINE_SIZE in
> >> <sys/param.h> which currently seems to be always be set to 64, but
> >> I'm pretty certain that is not necessarily a correct value.  For
> >> example, I'm pretty sure that for the PowerPC 32 and 128 are
> >> possibilities, and the same binary could run on machines with
> >> either line size so there is no correct compile-time answer.
> > 
> > You probably won't find many processors with cache lines greater
> > than 64 bytes. If you're optimising for a particular processor, read
> > technical manuals to find out the size of cache lines, then simply
> > define CACHE_LINE_SIZE or whatever compile time constant you're
> > using to a different value.
> 
> I'm not sure about other processors but I think all 64-bit PowerPC
> processors have 128 byte data cache lines;  the G5 certainly does.
> Other models have 32 or 64 byte data cache lines, so the same (32-bit)
> binary could be run on machines with any of those cache line sizes,
> or could be running on a uniprocessor where the best thing to do (at
> least when the concern is false sharing in threaded programs) is to
> ignore cache lines altogether.  I don't see a good reason to optimize
> a program for just one of those cases when, if you could just obtain
> the right number at run time (which the operating system should
> know) you could do the best thing for all of them.
> 
> The fact is, though, that the issue of false sharing is for the most
> part architecture and processor independent.  Data caches on modern
> (and even old) multiprocessors all work pretty much the same way, have
> exactly the same issues with sharing, and have their performance
> improved by exactly the same considerations.  With a single cache
> miss costing many hundreds of instructions getting data which could
> make good use of the cache off of cache lines which are, by program
> design, bound to be frequently invalidated can easily make a
> significant difference. If you can just obtain the right number at
> run time for the machine you are running on it is usually fairly
> simple to write code which always does the right thing.
> 
> Dennis Ferguson

Well I'm sure most people on this list are well aware of what false
sharing is. The problem is CPU topology and caching semantics are very
architecture specific. I don't think that adding code to dynamically
determine the best memory layout is going to produce massive speed
improvements.

1. It varies how CPUs and cache were designed, i.e. you may have 8 CPUs
each with its own private L2 cache, or you may have 4 CPUs each with 2
cores sharing common L2 cache.

2. Bloating your data structures with padding to avoid false sharing,
may result in more cache mises. This could be a serious performance
issue.

3. Dynamically allocating data to force its layout on a particular
memory boundary will require dereferencing a pointer each time you need
to access that data. You will have better performance if you add padding
to your data structure, so that layout is determined during compile
time.


Home | Main Index | Thread Index | Old Index