Prototyping the Scale Vector-Thread Processor

Ronny Krashinsky, Christopher Batten, and Krste Asanovic
MIT

This talk will present the Scale vector-thread processor, a complexity-effective solution for embedded computing. Scale's novel architecture flexibly supports both efficient vector and highly multithreaded processing. The VT architecture supports a seamless intermingling of vector and multithreaded computation to flexibly and compactly encode application parallelism and locality. In a VT machine, a control processor interacts with a parallel array of lanes. The control processor can use the lanes as slaves for data-parallel computation, but the same lanes can also fetch instructions and operate independently as a highly multithreaded compute engine. VT allows arbitrary interleaving of these vector and threaded control mechanisms at a fine granularity. For example, a vector-load can be followed by a vector-fetch which launches 100 threads that each execute 100 instructions.

Scale exploits the parallelism and locality exposed by VT to provide high performance with low power and small area. In our prior work, we presented preliminary descriptions of the Scale architecture [Krashinsky et. al., ISCA 2004; Batten et. al., MICRO 2004] and demonstrated through simulation that Scale provides competitive performance across a wide range of applications. On vectorizable image processing kernels (RGB-to-CMYK, RGB-to-YIQ, and high-pass grey-scale filter), Scale sustains 9.3-10.9 compute operations per cycle while simultaneously using unit-stride and segment vector-memory accesses to load and store 2.5-4.0 elements per cycle. For ADPCM speech decoding, a non-vectorizable kernel with cross-iteration loop dependencies, Scale exploits the available parallelism between loop iterations using decoupled lane execution and vector-loads to achieve 6.4 operations per cycle. For pointer-chasing code like Internet routing-table lookups, Scale uses fine-grained multithreaded execution to achieve 6.3 operations per cycle, while still using efficient vector-loads to feed the threads.

The Scale design includes a scalar control processor, a four-lane vector-thread unit with 16 decoupled execution clusters together with instruction fetch, load/store, and command management units; a vector-memory access unit with support for unit-stride, strided, and segment loads and stores; and a four-port, non-blocking, 32-way set-associative, 32 KB cache. The vector-thread unit supports up to 128 simultaneously active virtual processor threads. An automated and iterative design and verification flow enabled a performance, power, and area efficient implementation with only two person-years of development effort. The design has been fabricated and is fully functional. This talk will present details of the Scale chip prototype, and the design flow we used to build the chip.