tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Dynamically allocated locks



In message <1C737379-4C12-42E3-A0E8-90CF6C7340D7%3am-software.com@localhost>,
Matt Thomas writes:

>>
>>> Other than that its fine.  I can see not crossing a cacheline  
>>> boundary,
>>> but only 1 / cacheline is just wrong.
>>
>> It really does matter on MP systems. If you were to take a machine  
>> and have
>> all CPUs doing nothing but modifying operations on data within a  
>> single
>> cache line, then in the best case the machine would in effect be  
>> single
>> threaded: it would do no better than a single CPU system. In the  
>> worst case
>> it could be a lot slower due to (for example) costly writebacks to  
>> main
>> memory.

>I understand that but in reality we are only talking about a handful
>of lock per cacheline.  if the system has a few looks, this would be
>bad.  but as we go to finer grained locked, the chances of this being
>a problem drop dramatically.  For high contention locks (like global
>lock), one per cache line makes sense.  for low contention locks
>(like a socket lock), multiple per cacheline makes sense.


Matt, I think that's just about backwards. I apologize, in advance,
for a long-winded discussion of why I think that.

For NetBSD, let's talk about CC/NUMA systems.  with coarse-grained
locks --- one big lock, in the degenerate case --- every lock-op on
that lock is contentious. A large fraction of lock-ops in an MP system
is going to pay the penalty to pull the cache-line containing the lock
of some other CPU's cache.  On many architectures, pulling the data
out of another CPU's cache line is more expensive than a normal cache
miss.

Now suppose we retune the OS to use fine-grained locking. The intent
of the fine-grained locking is to avoid all that contention,
serialization, and (painfully expensive) synchronization traffic.
So far, so good.

But now suppose we "salt" each of those fine-grained locks with some
other lock object.  Now a CPU that's happily working on, say, *my*
socket, acquiring and releasing the lock on my socket, ends up causing
cache-line-granularity synchronization contention against any CPU
that's working with the "other" lock in that cache-line.
Bad, bad. (A  whiteboard to draw on would make this much clearer.)


As for the chances of this "dropping dramatically": if the system
has a low long-term rate of lock allocation, and you get a burst
of allocation of long-lived locks, those locks are going to fragment
just a few cachelines. I don't buy your argument.


>maybe two pools, one for each type?

If cache-lines are really that scarce, better to not have pools for
locks at all: embed each lock in the object which it locks, and make
sure that the (lock + object) is at least a cache-line in size.

If that's not feasible, and (as we have) there's a legitmate need to
support uniprocessor systems with small caches (arm, embedded P54C/P55
or comparable AMD cores?), then make lock-padding to cache-line
granularity a configurable compile-time option.


Home | Main Index | Thread Index | Old Index