‹header›

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

‹date/time›

‹footer›

‹#›

Today I’m going to talk about a research effort to create a scalable microprocessor.

This is work done in conjunction with MIT’s Laboratory for Computer Science.

In the last 6 years, our research group has created an architecture, a compiler,

a chip, and a motherboard.

Then, we cut the silicon up into an array of 16 identical, programmable tiles.

Now, notice that blue line. That is the distance that signal can travel over a wire in a single cycle.

The width of a tile should be slightly less than this, so that we can go through a small amount of logic

and all the way across a tile in a single cycle.

By construction, we know that the longest wire in the system is no longer than the length or width of a tile.

Let’s look at what’s inside a tile. A tile is not a wimpy thing. It’s got an 8 stage 32b MIPS-style single-issue

in-order compute processor,

a 32 KB instruction memory, a 32 KB data cache, and a 4-stage single-precision pipelined floating point unit.

Of course, we want to have the tiles work together and do useful work, so we’re also going to have a network

interface and the routers and wires.

Now that I’ve shown how we expose the wire resources, let me show you how we expose the pins.

It’s pretty simple. Routes off the edge of the chip are multiplexed down onto the pins. This gives

us 14 7.2 Gb/s channels, for a total of 201 Gb/s of bandwidth in and out of the chip.

We can hook up things like PCI buses, DRAMS and antennaes to these I/O ports. In fact, the problem

has been finding I/O devices that can come even close to saturating a channel.

Now, anyone can put a bunch of pins on a chip and get bandwidth. The key is that an tile can toggle an data pin just

by sending a message there. That means that we have a mechanism for directly controlling the I/O resources of the chip.

This contrasts with a conventional microprocessor, where it uses all of the pins solely to slosh cache lines back and forth.

On the left is a zoomed-in photograph of the Raw tile, on the right is the standard cells placement. You can look at the proceedings to identify different parts; but in particular, notice that you can see the 256 network wires on the upper left and right sides of the chip. You can also kind of see them on the top and bottom a little bit. The white regions on the standard cell layout are where the network wires pass through. You can also see the Rams, the very dark regions on the chip; the fuses for the rams (the white regions), and many

other details.

Now the static router is very useful, but it would nice to have a tool to automatically program it. That’s what RawCC does. It takes a sequential C program and parallelizes it across the tiles. This is process has a lot of simularities to circuit partitioning, placement and routing. RawCC starts with a code, and then turns into a data flow graph. It partitions the dataflow graph into threads, and then assigns those threads to tiles.

Finally, it programs the static routers to route values between the tiles.

other details.

One of the nice things about Raw is that it scales. These three things are all independent of transistor count:

the longest wire, the design complexity, and the verification complexity. You can see in the picture, that with a process

shrink, everything shrinks in proportion: the size of the tile and the length of the longest wire.

Additionally, the number of tiles, the network bandwidth, and the I/O bandwidth will scale directly with the physical

quantity of resources that actually available.

Finally, note that Raw is backwards-compatible. A program that runs on 2 tiles on the left will continue to run on 2 tiles on the right.

Then, we cut the silicon up into an array of 16 identical, programmable tiles.

Now, notice that blue line. That is the distance that signal can travel over a wire in a single cycle.

The width of a tile should be slightly less than this, so that we can go through a small amount of logic

and all the way across a tile in a single cycle.

By construction, we know that the longest wire in the system is no longer than the length or width of a tile.

Here are some stats for the Raw processor.

Then, we cut the silicon up into an array of 16 identical, programmable tiles.

Now, notice that blue line. That is the distance that signal can travel over a wire in a single cycle.

The width of a tile should be slightly less than this, so that we can go through a small amount of logic

and all the way across a tile in a single cycle.

By construction, we know that the longest wire in the system is no longer than the length or width of a tile.

Then, we cut the silicon up into an array of 16 identical, programmable tiles.

Now, notice that blue line. That is the distance that signal can travel over a wire in a single cycle.

The width of a tile should be slightly less than this, so that we can go through a small amount of logic

and all the way across a tile in a single cycle.

By construction, we know that the longest wire in the system is no longer than the length or width of a tile.

Then, we cut the silicon up into an array of 16 identical, programmable tiles.

Now, notice that blue line. That is the distance that signal can travel over a wire in a single cycle.

The width of a tile should be slightly less than this, so that we can go through a small amount of logic

and all the way across a tile in a single cycle.

By construction, we know that the longest wire in the system is no longer than the length or width of a tile.

other details.