Other Projects

From SuperTech Wiki

Revision as of 05:56, 3 June 2011 by Neboat (Talk | contribs)
Jump to: navigation, search

Location-Based Memory Fences

Traditional memory fences are program-counter (PC) based. That is, a memory fence enforces a serialization point in the program instruction stream --- it ensures that all memory references before the fence in the program order have taken effect before the execution continues onto instructions after the fence. Such PC-based memory fences always cause the processor to stall, even when the synchronization is unnecessary during a particular execution. We propose the concept of location-based memory fences, which aim to reduce the cost of synchronization due to the latency of memory fence execution in parallel algorithms.

Unlike a PC-based memory fence, a location-based memory fence serializes the instruction stream of the executing thread only when a different thread attempts to read the memory location which is guarded by the location-based memory fence. In this work, we describe a hardware mechanism for location-based memory fences, prove its correctness, and evaluate its potential performance benefit. Our experimental results are based on a software simulation of the proposed location-based memory fence, which incurs higher overhead than the proposed hardware mechanism would. Even though applications using the software prototype implementation do not scale as well compared to the traditional memory fences due to the software overhead, our experiments show that applications can benefit from using location-based memory fences. These results suggest that a hardware support for location-based memory fences is worth considering.

Deterministic Parallel Random-Number Generation

Existing concurrency platforms for dynamic multithreading, such as Cilk and TBB, do not provide repeatable parallel random-number generators. We propose that a mechanism called pedigrees be built into the runtime system to enable efficient deterministic parallel random-number generation, and in this work we design an efficient variant of the pedigree mechanism. Experiments with the MIT Cilk runtime system show that the overhead for this mechanism is minimal. On a suite of 10 benchmarks, the relative overhead of Cilk with pedigrees to the original Cilk has a geometric mean of 2%. We also explore library implementations of several deterministic parallel random-number generators that use these runtime mechanisms, based on a generalization of linear congruential generators, XOR'ing entries in a random table, SHA1, and Feistel networks. Although these deterministic parallel random-number generators are 3 to 18 times slower per function call than a nondeterministic parallel version of the popular Mersenne twister, in practical applications that use random numbers the additional overhead from using an efficient, high-quality DPRNG is relatively small.

Personal tools
Members Only
Off Topic