Boston Area Architecture Workshop (BARC-2003)

Cambridge Marriot, January 30, 2003


8:45am: Introduction
9am - 9:30am: Keynote (Session Chair: Joel Emer)

The Transactional Manifesto: Toward a Low-Level API for Scalable Synchronization

Maurice Herlihy, Brown University Computer Science

Locking has well-known scalability problems, and is increasingly recognized as a software engineering quagmire. In response, researchers have investigated alternatives such as lock-free (and wait-free) synchronization.

Although modern architectures provide primitives powerful enough in principle to support any kind of lock-free synchronization, there is still an inherent performance mismatch between conventional architectures and the kinds of scalable synchronization demanded by emerging applications.

Software designers want lock-free transactions that affect multiple memory locations. Modern architectures provide effective support only for optimistic transactions that affect only a single memory location. What is to be done?

Let's split the difference. I challenge hardware architects to provide efficient, optimistic, multi-location transactions ("transactional memory"). All the basic mechanisms already exist; you just need to combine them effectively. If you build it, the scalable applications will flock to your architecture.

Are you equal to the task?

Slides
Contact email: herlihy@cs.brown.edu
Project webpage


9:30am - 10:30am: Security and Reliability (Session Chair: Dave Kaeli)

Secure Hardware Processors

G. Edward Suh, Dwaine Clarke, Blaise Gassend, Marten van Dijk, Srinivas Devadas
MIT Laboratory for Computer Science

We discuss two hardware primitives to build a processor that is secure from physical attacks as well as software attacks: memory integrity verification and memory encryption. The integrity verification ensures that the off-chip memory has not been altered by an adversary, and the encryption protects the privacy of off-chip data. These primitives enable many security-sensitive applications such as copy-proof software and certified execution, which need to operate in a hostile environment or with untrusted users. They can also provide security without trusting the operating system. To be practical for primary processors, we describe how these primitives can be implemented with minimal performance overhead.

Slides
Contact email: suh@mit.edu
Project webpage

Mondriaan Memory Protection

Emmett Witchel and Krste Asanovic
MIT Laboratory for Computer Science

Mondriaan Memory Protection is a hardware/software co-design for fine-grained memory protection which allows multiple protection domains to flexibly share memory and export protected services. Through careful engineering, we provide word level protection with less than a 9% space overhead for the protection tables and less than an 8% overhead in additional memory references to those tables for a wide variety of applications. We are currently applying MMP to linux kernel modules to enforce memory access safety. We also believe that MMP will enable services to export more highly optimized interfaces that are still safe (e.g., having the user write directly into the kernel's buffer cache).

Slides
Contact email: witchel@lcs.mit.edu, krste@mit.edu
Project webpage

Checker Processor Design

Christopher T. Weaver¹, Todd Austin²
¹University of Michigan/Intel, ²University of Michigan

Dynamic Verification has been proposed as a method to reduce the burden of verification in increasing complex designs, by placing a simple checker processor in series with high performance complex core. This checker processor utilizes the speculative pre-execution from the complex core to keep the design simple and still achieve matching throughput. Since the checker processor is able to detect and recover from errors, the burden of architectural correctness is placed solely on the checker processor. Previous analysis show that a 64 bit checker processor based on the Alpha ISA will consume around 15mm² and operate at a frequency of approximately 300MHz. This estimates were taken from the synthesis of a one stage-one wide checker processor. The area was then linearly scaled by four to achieve an area estimate for a 4 wide checker. This talk will present more accurate results of the impact on scaling the checker wider, deeper and taller.

Contact email: christopher.t.weaver@intel.com


10:30am - 11am: Break
11am - noon: Memory Systems, Simulation, and I/O (Session Chair: Iris Bahar)

The Alpha 21364 Memory System

Peter Bannon, Dave Webb, Brian Lilly, Maurice Steinman
HP

The Alpha 21364 integrates a high performance RISC core with an L2 cache, Direct RDRAM memory controller and a router. The entire design operates at 1.2 GHz, providing exceptional performance for applications requiring memory systems with high capacity, low latency, and high bandwidth. The talk will give a brief overview of the memory system along with performance measurements.

Slides
Contact email: pbannon@hp.com

Bridging the Compiler-Simulator Gap

Eliot Moss and Chip Weems
Department of Computer Science
University of Massachusetts Amherst

Currently it is hard to make good performance predictions of proposed architectural features, especially for applications written in dynamically compiled languages like Java. To achieve best performance, modern architectures require matched compilers, tuned to the architecture. One also needs a timing-accurate simulator for the proposed architecture. Building both a tuned compiler and an accurate and efficient cycle-level simulator is labor-intensive and error-prone.
We aim to build tools that generate competent optimizing compiler "back-ends" and reasonably efficient timing simulators, from shared descriptions of: CPU instruction set syntax and semantics; hardware pipelines annotated with semantics; and the desired instrumentation. Our compiler framework is Jikes RVM, a retargetable Java VM; our simulation framework will be like Schnarr's but implemented in Java, exploiting run-time code generation and dynamic adaptive optimization. Automatically generated architecture-dependent parts will plug in to the frameworks. The goal is push-the-button construction of matching compilers and simulators, greatly speeding architectural exploration.

Slides
Contact email: moss@cs.umass.edu
Project webpage

Profile-Guided Data Partitioning to Achieve High Performance I/O

Yijian Wang, David Kaeli
Northeastern University

In the field of high-performance parallel computing we are seeing a shift away from applications that are starved of CPU cycles, to applications that are both CPU-bound and I/O-bound. The performance of these systems cannot easily be scaled by just increasing the number of compute nodes. We must be able to also parallelize I/O streams in order to achieve good speedup. This issue continues to increase in importance as the gap between CPU speed and disk speed continues to widen. While parallel I/O systems provide users with environments where large datasets can be shared among multiple processors, the performance of I/O intensive applications depends heavily on how the data is accessed and where the data is physically located on disks. To achieve high performance parallel I/O, I/O operations should be parallelized both at the application level and at the disk level. We need to be able to understand data access patterns, and then we can optimize the data layout to better match these patterns. In this paper, we present a characterization of parallel I/O access patterns by generating parallel I/O traces. We then use these traces to determine how best to partition the data across multiple disks to achieve high I/O parallelism. In our future work we would like to obtain this information through a statically generated dataflow graph. Then we can then perform a coarse-grained disk tiling across many disks.

Slides
Contact email: kaeli@ece.neu.edu, yiwang@ece.neu.edu
Project webpage


Noon - 1:30pm: Buffet Lunch
1:30pm - 2:30pm: Processor Organization (Session Chair: Csaba A. Moritz)

Stream Algorithms: Decoupled Systolic Computation on Tiled Microarchitectures

Henry Hoffmann, Volker Strumpen, Anant Agarwal
MIT Laboratory for Computer Science

Communication-exposed, programmable microarchitectures including Trips, Smart Memories, and Raw offer an opportunity to schedule instruction execution and data movement explicitly. This paper proposes stream algorithms, which, along with a decoupled systolic architecture, provide an excellent match for the physical and technological constraints of single-chip tiled architectures. Stream algorithms enable programmed systolic computations for different problem sizes, without incurring the cost of memory accesses. To that end, we decouple memory accesses from computation and move the memory accesses off the critical path. By structuring computations in systolic phases, and deferring memory accesses to dedicated memory processors, stream algorithms can solve many regular problems with varying sizes on a constant-sized tiled array. Contrary to common sense, the compute efficiency of stream algorithms increases as we increase the number of processing elements. In particular, we show that the compute efficiency of stream algorithms can approach 100% asymptotically, that is for large numbers of processors and appropriate problem size.

Slides
Contact email: Henry Hoffmann

The Vector-Thread Architecture

Ronny Krashinsky, Chris Batten, and Krste Asanovic
MIT Laboratory for Computer Science

Traditional vector architectures provide a cost-effective and energy-efficient mechanism to exploit data parallelism but are only effective on vectorizable loop nests. The vector-thread architecture enhances a vector unit to support many forms of fine-grain parallelism from pure SIMD data parallelism to pure MIMD thread-level parallelism, with retaining most of the cost and power advantages of a traditional vector machine.

Slides
Contact email: ronny@mit.edu, krste@mit.edu
Project webpage

Characterization and Evaluation of Hardware Loop Unrolling

Marcos R. de Alba, David R. Kaeli
Northeastern University

General purpose programs contain loops that cannot be optimized by a compiler. When the body of a loop contains conditional control flow instructions or if the loop is controlled by a non-constant induction variable, the compiler again cannot unroll this loop. We have found that the compiler cannot unroll greater than 40-50\% of the static loops in the set of programs studied. To be able to optimize the execution these loops, we have to detect loop behavior at runtime. We propose that loops that cannot be optimized by the compiler should be detected and unrolled using a hardware-based unrolling mechanism. Our design exploits the temporal locality found in loops to provide a higher degree of instruction level parallelism. Using hardware-based unrolling, multiple basic blocks can be retrieved from a dedicated loop cache, reducing the number of instruction cache and memory requests, while providing a large window of instructions for speculative execution. Before designing our hardware mechanism, we characterized all loops that cannot be optimized by the compiler. Using these characteristics we construct the design of a hardware mechanism that will allow us to unroll loop iterations dynamically. To drive our prediction mechanism, we use correlation between the pattern of branch outcomes that lead up to a loop with the path of branches executed within the loop body. We capture a history of the sequence of paths followed in a loop to predict the entire loop visit. We can then unroll entire loop bodies without the aid of the compiler. To characterize loop execution and evaluate the effectiveness of the proposed mechanisms, we study three different sets of benchmarks: mediabench, mibench and a subset of SPECint2000 (the loop intensive benchmarks). Our results show that hardware-based loop unrolling can be performed dynamically and provides us with new levels of instruction-level parallelism. We have found that we can consistently increase the IPC using this mechanism, achieving maximum speedups greater than 20\%.

Slides
Contact email: kaeli@ece.neu.edu, mdealba@ece.neu.edu
Project webpage


2:30pm - 3:00pm: Break
3:00pm - 4:00pm: Power (Session Chair: Saman Amarsinghe)

Power-Aware System on a Chip

Andrew Laffely, Jian Liang, Russell Tessier, Csaba Andras Moritz, Wayne Burleson
University of Massachusetts Amherst

Many circuit level techniques have been aimed at power and performance improvement for VLSI systems. One of the overriding problems facing the SoC design community is how to implement these various circuits to effectively control/coordinate the power and performance across the whole SoC. The approach proposed here is to exploit the streaming nature of the intended applications to develop a system-wide, communication driven methodology for adaptive power and performance management in SoC. This approach monitors global interconnect resources to find points of congestion or slack. The clock and voltage of congested resources are increased, while those with slack can be decreased. . In addition, data rates at critical points in the system can be monitored and used to globally to enforce user-defined standards. This presentation presents our methodology, test bench, and promising preliminary results.

Slides
Contact email: alaffely@ecs.umass.edu
Project aSoC home page

Combining Static and Runtime IPC Predictions for Chip-wide Energy Efficiency

Saurabh Chheda, Osman Unsal, Csaba Andras Moritz
University of Massachusetts, Amherst

This paper presents a suite of compiler, architecture, and hybrid compiler-architecture based techniques to obtain chip-wide energy reduction in the processor. We combine statically estimated IPC information, at various granularities, with dynamically predicted IPC information to adaptively adjust voltage and speed, and to throttle processor resources in response to changes in ILP. We show that IPC-based voltage scaling is an efficient way to adjust energy to match target performance in case there is a slack. We have also found that target performance is often not met without cumulative IPC tracking. We have found that both static and compiler-architecture IPC-based resource throttling schemes can save up to 14% energy in the processor with less than 5% IPC degradation. Lowest IPC degradation of all the evaluated schemes is obtained with the static only approach. The proposed IPC based adaptive voltage scaling schemes can reduce energy by more than 50% depending on the target performance chosen

Slides
Contact email: schheda@ecs.umass.edu
Project webpage

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning

Eric Chi¹, A. Michael Salem¹, R. Iris Bahar¹, Richard Weiss²
¹Brown University, ²Hampshire College

By anticipating when resources will be idle, it is possible to reconfigure the hardware to reduce power consumption without significantly reducing performance. This requires predicting the application's resource requirements. This paper explores combining hardware monitoring with software profiling to implement a control policy that achieves lower power utilization than using either technique alone. We demonstrate the potential for this approach in two ways. We show that anticipating stalls due to critical load misses in the L2 cache can enable fetch halting. Halting instruction fetches just as a stalling instruction becomes ready makes it possible to disable large parts of the instruction pipeline. In addition, we compare hardware monitoring and software profiling of IPC for code blocks and show that they capture different information.

Slides
Contact email: iris_bahar@brown.edu


4:00pm - 4:30pm: Break
4:30pm - 5:30pm: Compiler Technology (Session Chair: Pete Bannon)

Using the Compiler to Improve Cache Replacement Decisions

Zhenlin Wang¹, Kathryn S. McKinley², Arnold L. Rosenberg¹, Charles C. Weems¹
¹Umass, Amherst, ²UT, Austin

To improve replacement decisions in set-associative caches, we develop a new set of compiler algorithms that predict which data will and will not be reused and provide these hints to the architecture. We prove that the hints either match or improve hit rates over LRU. We describe a practical one-bit cache-line tag implementation of our algorithm, called evict-me. On a cache replacement, the architecture will replace a line for which the evict-me bit is set, or if none is set, it will use the LRU bits. We implement our compiler analysis and its output in the Scale compiler. On a variety of scientific programs, using the evict-me algorithm in both the level 1 and 2 caches improves simulated cycle times by up to 34% over the LRU policy by increasing hit rates. In addition, a combination of simple hardware prefetching and evict-me works together to further improve performance.

Slides
Contact email: zlwang@cs.umass.edu

Compiler-Enabled Cache Management for Pointer-Intensive Programs

Yao Guo, Saurabh Chheda, Csaba Andras Moritz
University of Massachusetts, Amherst

Recently proposed compiler-enabled memory system designs have been successful in reducing energy consumption as well as improving cache performance. A major challenge in these systems is their applicability for complex pointer-intensive programs. The problems are related to complexity issues and other limitations in traditional precise pointer analysis, due to the conservative nature of the analysis to guarantee compile time provable correctness. For example, this happens when special constructs such as pointer based calls, recursion, or library calls are found in the program. To overcome these limitations, we propose speculative points-to and distance analysis. The technique is speculative in the sense that the possible values for each pointer access are statically determined based on their runtime likelihood of occurrence. To guarantee correctness, the information provided is checked at runtime with simple microarchitecture support. We show promising preliminary results by incorporating this analysis into a compiler-enabled energy efficient memory system framework.

Slides
Contact email: yaoguo@ecs.umass.edu

Meta Optimization: Improving Compiler Heuristics with Machine Learning

Mark Stephenson¹, Una-May O'Reilly², Martin Martin², Saman Amarasinghe¹
¹MIT Laboratory for Computer Science, ²MIT Artificial Intelligence Laboratory

Compiler writers have crafted many heuristics over the years to (approximately) solve NP-hard problems efficiently. Finding a heuristic that performs well on a broad range of applications is a tedious and difficult process. This paper introduces Meta Optimization, a methodology for automatically fine-tuning compiler heuristics. Meta Optimization uses machine-learning techniques to automatically search an optimization's solution space. We implemented Meta Optimization on top of Trimaran to test its efficacy. By `evolving' Trimaran's hyperblock selection optimization for a particular benchmark, our system achieves impressive speedups. Application-specific heuristics obtain an average speedup of 23% (up to 43%) for the applications in our suite. Furthermore, by evolving a compiler's heuristic over several benchmarks, we can create effective, general-purpose compilers. The best general-purpose heuristic our system found improved Trimaran's hyperblock selection algorithm by an average of 25% on our training set, and 9% on a completely unrelated test set. We further test the applicability of our system on Trimaran's priority-based coloring register allocator. For this well-studied optimization we were able to specialize the compiler for individual applications, achieving an average speedup of 6%.

Slides
Contact email: mstephen@mit.edu
Project webpage


5:30pm - 5:40pm: Concluding Remarks