Read-Log-Update – A Lightweight Synchronization Mechanism for Concurrent Programming
In this project we introduce read-log-update (RLU), a novel extension of the popular read-copy-update (RCU) synchronization mechanism that supports scalability of concurrent code by allowing unsynchronized sequences of reads to execute concurrently with updates. RLU overcomes the major limitations of RCU by allowing, for the first time, concurrency of reads with multiple writers, and providing automation that eliminates most of the programming difficulty associated with RCU programming. At the core of the RLU design is a logging and coordination mechanism inspired by software transactional memory algorithms.
In a collection of micro-benchmarks in both the kernel and user space, we show that RLU both simplifies the code and matches or improves on the performance of RCU. As an example of its power, we show how it readily scales the performance of a real-world application, Kyoto Cabinet, a truly difficult concurrent programming feat to attempt in general, and in particular with classic RCU.
ThreadScan – Automatic and Scalable Concurrent Memory Reclamation
The concurrent memory reclamation problem is that of devising a way for a deallocating thread to verify that no other concurrent threads hold references to a memory block being deallocated. To date, there is no satisfactory solution to this problem: existing tracking methods like hazard pointers, reference counters, or epoch-based techniques like RCU, are either prohibitively expensive or require significant programming expertise, to the extent that implementing them efficiently can be worthy of a publication. None of the existing techniques are automatic or even semi-automated.
In this project, we take a radical new approach to concurrent memory reclamation: instead of manually tracking access to memory locations as done in techniques like hazard pointers, or restricting shared accesses to specific epoch boundaries as in RCU, our algorithm, called ThreadScan, uses operating system signaling and page protection to automatically detect which memory locations are being accessed.
Initial empirical evidence of using ThreadScan shows that it scales surprisingly well, and requires negligible programming effort beyond the standard use of Malloc and Free.
RH-NOrec – Safe and Scalable Hybrid Transactional Memory
Because of hardware TM limitations, software fallbacks are the only way to make TM algorithms guarantee progress. Nevertheless, all known software fallbacks to date, from simple locks to sophisticated versions of the NOrec Hybrid TM algorithm, have either limited scalability or weakened semantics.
This work presents a novel reduced-hardware (RH) version of the NOrec HyTM algorithm. Instead of an all-software slow path, in our RH NOrec the slow-path is a “mix” of hardware and software: one short hardware transaction executes a maximal amount of initial reads in the hardware, and the second executes all of the writes. This novel combination of the RH approach and the NOrec algorithm delivers the first Hybrid TM that scales while fully preserving the hardware’s original semantics of opacity and privatization.
Our GCC implementation of RH NOrec is promising in that it shows improved performance relative to all prior methods, at the concurrency levels we could test today.
The SprayList: A Scalable Relaxed Priority Queue
High-performance concurrent priority queues are essential for applications such as task scheduling and discrete event simulation. Unfortunately even the best performing implementations do not scale past a number of threads in the single digits. This is because of the sequential bottleneck in accessing the elements at the head of the queue in order to perform a DeleteMin operation. In this paper, we present the SprayList, a scalable priority queue with relaxed ordering semantics. Starting from a nonblocking SkipList, the main innovation behind our design is that the DeleteMin operations avoid a sequential bottleneck by “spraying” themselves onto the head of the SkipList list in a coordinated fashion. The spraying is implemented using a carefully designed random walk, so that DeleteMin always returns an element among the first O(p polylog p) in the list, where p is the number of threads. We prove that the expected running time of a DeleteMin operation is poly-logarithmic in p, independent of the size of the list, and also provide analytic upper bounds on the number of possible priority inversions for an element. Our experiments show that the relaxed semantics allow the data structure to scale for very high thread counts, comparable to a classic unordered SkipList. Furthermore, we observe that, for reasonably parallel workloads, the scalability benefits of relaxation considerably outweigh the additional work due to out-of-order execution.
StackTrack – An Automated Transactional Approach to Concurrent Memory Reclamation
Dynamic memory reclamation is arguably the biggest open problem in concurrent data structure design: all known solutions induce high overhead, or must be customized to the specific data structure by the programmer, or both. This work presents StackTrack, the first concurrent memory reclamation scheme that can be applied automatically by a compiler, while maintaining efficiency. StackTrack eliminates most of the expensive bookkeeping required for memory reclamation by leveraging the power of hardware transactional memory (HTM) in a new way: it tracks thread variables dynamically, and in an atomic fashion. This effectively makes all memory references visible without having threads pay the overhead of writing out this information. Our empirical results show that this new approach matches or outperforms prior, non-automated, techniques.