tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Dynamically allocated locks
In message <1C737379-4C12-42E3-A0E8-90CF6C7340D7%3am-software.com@localhost>,
Matt Thomas writes:
>>
>>> Other than that its fine. I can see not crossing a cacheline
>>> boundary,
>>> but only 1 / cacheline is just wrong.
>>
>> It really does matter on MP systems. If you were to take a machine
>> and have
>> all CPUs doing nothing but modifying operations on data within a
>> single
>> cache line, then in the best case the machine would in effect be
>> single
>> threaded: it would do no better than a single CPU system. In the
>> worst case
>> it could be a lot slower due to (for example) costly writebacks to
>> main
>> memory.
>I understand that but in reality we are only talking about a handful
>of lock per cacheline. if the system has a few looks, this would be
>bad. but as we go to finer grained locked, the chances of this being
>a problem drop dramatically. For high contention locks (like global
>lock), one per cache line makes sense. for low contention locks
>(like a socket lock), multiple per cacheline makes sense.
Matt, I think that's just about backwards. I apologize, in advance,
for a long-winded discussion of why I think that.
For NetBSD, let's talk about CC/NUMA systems. with coarse-grained
locks --- one big lock, in the degenerate case --- every lock-op on
that lock is contentious. A large fraction of lock-ops in an MP system
is going to pay the penalty to pull the cache-line containing the lock
of some other CPU's cache. On many architectures, pulling the data
out of another CPU's cache line is more expensive than a normal cache
miss.
Now suppose we retune the OS to use fine-grained locking. The intent
of the fine-grained locking is to avoid all that contention,
serialization, and (painfully expensive) synchronization traffic.
So far, so good.
But now suppose we "salt" each of those fine-grained locks with some
other lock object. Now a CPU that's happily working on, say, *my*
socket, acquiring and releasing the lock on my socket, ends up causing
cache-line-granularity synchronization contention against any CPU
that's working with the "other" lock in that cache-line.
Bad, bad. (A whiteboard to draw on would make this much clearer.)
As for the chances of this "dropping dramatically": if the system
has a low long-term rate of lock allocation, and you get a burst
of allocation of long-lived locks, those locks are going to fragment
just a few cachelines. I don't buy your argument.
>maybe two pools, one for each type?
If cache-lines are really that scarce, better to not have pools for
locks at all: embed each lock in the object which it locks, and make
sure that the (lock + object) is at least a cache-line in size.
If that's not feasible, and (as we have) there's a legitmate need to
support uniprocessor systems with small caches (arm, embedded P54C/P55
or comparable AMD cores?), then make lock-padding to cache-line
granularity a configurable compile-time option.
Home |
Main Index |
Thread Index |
Old Index