Today’s post will explore why the current ways to communicate between threads don’t scale, and show you a possible way to build scalable parallel programming on top of shared memory.
The problem with shared memory
Soon, we will have dozens, hundreds and then millions of cores in our computers. It’s inevitable, because individual cores just can’t get much faster. At some point, that’s going to mean that we have to rethink our architecture entirely, as millions of cores can’t all access a shared memory space efficiently. But millions of cores are still a long way off, and in the meantime we’ll see machines with dozens of cores, struggling with shared memory.
Alex’s tip: The best way for an application to make use of that increasing parallel power is to use a concurrency model like actors, that deals with synchronisation issues for you. Then, the maintainer of the actors framework can find the most efficient way to coordinate access to shared memory to allow your actors to pass messages to each other efficiently.
At the moment, NAct uses the .NET thread pool and a few locks to marshal messages. It works well on dual and quad core machines, but it won’t scale to more cores. Every time we use a lock, our core performs an atomic memory operation (eg. CAS) on a cell of memory representing the lock, so it’s sure that no other core can possibly have that lock. This is very fast when the lock isn’t contended, but we need to notify all the other cores, in case they held the cell of memory in a cache. As the number of cores increases, the total cost of a lock increases linearly.
A lot of work has been done on “lock-free” data structures, which avoid locks by using atomic memory operations directly. These give fairly dramatic performance improvements, particularly on systems with a few (2 to 4) cores. The .NET 4 concurrent collections in System.Collections.Concurrent are mostly lock-free. However, lock-free data structures still don’t scale indefinitely, because any use of an atomic memory operation still involves every core in the system.
A sync-free data structure
Some concurrent data structures are possible to write in a completely synchronization-free way, without using any atomic memory operations. One useful example is a single producer, single consumer (SPSC) queue. It’s easy to write a sync-free fixed size SPSC queue using a circular buffer*. Slightly trickier is a queue that grows as needed. You can use a linked list to represent the queue, but if you leave the nodes to be garbage collected once you’re done with them, the GC will need to involve all the cores in collecting the finished nodes. Instead, I’ve implemented a proof of concept inspired by this intel article which reuses the nodes by putting them in a second queue to send back to the producer.
* In all these cases, you need to use memory barriers correctly, but these are local to a core, so don’t have the same scalability problems as atomic memory operations.
I tried benchmarking my SPSC queue against the .NET ConcurrentQueue, and against a standard Queue protected by locks. In some ways, this isn’t a fair comparison, because both of these support multiple producers and multiple consumers, but I’ll come to that later.
I started on my dual-core laptop, running a simple test that had one thread producing 64 bit integers, and another consuming them, to measure the pure overhead of the queue. Both concurrent collections perform better than the lock-based one as expected, but there’s not a lot to choose between the ConcurrentQueue and my SPSC queue. I was a little disappointed, but then, the .NET Framework team spent a lot longer optimising it than I did.
So I dug out a more powerful machine that Red Gate’s DBA tools team had been using for testing. It is a 6 core Intel i7 machine with hyperthreading, adding up to 12 logical cores. Now the results get more interesting. As I increased the number of producer-consumer pairs to 6 (to saturate all 12 logical cores), the locking approach was slow, and got even slower, as you’d expect.
What I didn’t expect to be so clear was the drop-off in performance of the lock-free ConcurrentQueue. I could see the machine only using about 20% of available CPU cycles when it should have been saturated. My interpretation is that as all the cores used atomic memory operations to safely access the queue, they ended up spending most of the time notifying each other about cache lines that need invalidating.
The sync-free approach scaled perfectly, despite still working via shared memory, which after all, should still be a bottleneck. I can’t quite believe that the results are so clear, so if you can think of any other effects that might cause them, please comment!
Obviously, this benchmark isn’t realistic because we’re only measuring the overhead of the queue. Any real workload, even on a machine with 12 cores, would dwarf the overhead, and there’d be no point worrying about this effect. But would that be true on a machine with 100 cores?
Still to be solved.
The trouble is, you can’t build many concurrent algorithms using only an SPSC queue to communicate. In particular, I can’t see a way to build something as general purpose as actors on top of just SPSC queues. Fundamentally, an actor needs to be able to receive messages from multiple other actors, which seems to need an MPSC queue. I’ve been thinking about ways to build a sync-free MPSC queue out of multiple SPSC queues and some kind of sign-up mechanism. Hopefully I’ll have something to tell you about soon, but leave a comment if you have any ideas.
Edit: The example code is available on BitBucket.