The Scale Vector-Thread Architecture

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Jared Casper, Krste Asanovic

Parallelism and locality are the key application characteristics exploited by computer architects to make productive use of increasing transistor counts while coping with wire delay and power dissipation. Conventional sequential ISAs provide minimal support for encoding parallelism or locality, so high-performance implementations are forced to devote considerable area and power to on-chip structures that extract parallelism or that support arbitrary global communication. The large area and power overheads are justified by the demand for even small improvements in performance on legacy codes for popular ISAs. Many important applications have abundant parallelism, however, with dependencies and communication patterns that can be statically determined. ISAs that expose more parallelism reduce the need for area and power intensive structures to extract dependencies dynamically. Similarly, ISAs that allow locality to be expressed reduce the need for long-range communication and complex interconnect. The challenge is to develop an efficient encoding of an application's parallel dependency graph and to reduce the area and power consumption of the microarchitecture that will execute this dependency graph.

The vector-thread (VT) architectural paradigm unifies the vector and multithreaded execution models. VT allows large amounts of structured parallelism to be compactly encoded in a form that allows a simple microarchitecture to attain high performance at low power by avoiding complex control and datapath structures and by reducing activity on long wires. The VT programmer's model extends a conventional scalar control processor with an array of slave virtual processors (VPs). VPs execute strings of RISC-like instructions packaged into atomic instruction blocks (AIBs). To execute data-parallel code, the control processor broadcasts AIBs to all the slave VPs. To execute thread-parallel code, each VP directs its own control flow by fetching its own AIBs. Implementations of the VT architecture can also exploit instruction-level parallelism within AIBs.

In this way, the VT architecture supports a modeless intermingling of all forms of application parallelism. This flexibility provides new ways to parallelize codes that are difficult to vectorize or that incur excessive synchronization costs when threaded. Instruction locality is improved by allowing common code to be factored out and executed only once on the control processor, and by executing the same AIB multiple times on each VP in turn. Data locality is improved as most operand communication is isolated to within an individual VP.

We have developed a prototype processor, Scale, which is an instantiation of the vector-thread architectural paradigm designed for low-power and high-performance embedded systems. As transistors have become cheaper and faster, embedded applications have evolved from simple control functions to cellphones that run multitasking networked operating systems with realtime video, three-dimensional graphics, and dynamic compilation of garbage-collected languages. Many other embedded applications require sophisticated high-performance information processing, including streaming media devices, network routers, and wireless base stations. The Scale architecture is intended to replace ad-hoc collections of microprocessors, DSPs, FPGAs, and ASICs, with a single hardware substrate that supports a unified high-level programming environment and provides performance and energy-efficiency competitive with custom silicon. We taped out the Scale chip on 10/15/2006.

Die plot of the Scale chip.

Publications

[1] "The Vector-Thread Architecture", Ronny Krashinsky, Christopher Batten, Mark Hampton, Steven Gerding, Brian Pharris, Jared Casper, and Krste Asanovic, 31st International Symposium on Computer Architecture (ISCA-31), Munich, Germany, June 2004. (PDF paper, PDF slides, PPT slides)
[2] "The Vector-Thread Architecture", Ronny Krashinsky, Christopher Batten, Mark Hampton, Steven Gerding, Brian Pharris, Jared Casper, and Krste Asanovic, IEEE Micro Special Issue: Micro's Top Picks from Microarchitecture Conferences, November/December 2004. (PDF paper)
[3] "Cache Refill/Access Decoupling for Vector Machines", Christopher Batten, Ronny Krashinsky, Steve Gerding, and Krste Asanovic, 37th International Symposium on Microarchitecture (MICRO-37), Portland, OR, December 2004. (PDF paper, PDF slides, PDF slides+notes)

Funding

We gratefully thank the past and present sponsors of this work, including DARPA, NSF, CMI, Infineon, and Intel.