Continued increases in processor performance will come from explicit parallelism with careful management of locality and bandwidth. To exploit numerous fast transistors connected by slow wires, chips in 2010 will contain tens of processors, each with local memory, connected by a fast on-chip network. Demanding applications have sufficient parallelism to efficiently use such multiprocessor chips. Such chips will be limited by communication rather than arithmetic. Data and threads must be located to minimize latency due to slow on-chip wires, and to prevent limited off-chip bandwidth from becoming a bottleneck.
While acknowledging that architecture is only one part of the microprocessor performance equation, at least in that arena, I believe there is considerable promise for improvements over the next decade. In the near term, enhanced superscalar designs will continue to provide considerable gains. Those gains will be compounded by techniques like multithreading, and simultaneous multithreading in particular. In the slightly longer term, continued recapitulation of mainframe design history by microprocesors will provide enormous amounts of raw performance, primarily by invoking increased parallelism. But, a significant challenge will be exploiting that parallelism across a broader range of applications.
Process scaling, SOI, and circuit innovations will deliver 10X in frequency; and increased parallelism another 10X for a total performance boost of 100X. But the obstacles are many. We will learn to deal with clock frequencies of 7 GHz but power will be a major issue forcing innovations as laptops/handhelds drive technology. Gaining 10X from parallelism will require advances in distributed processors and algorithms. Extremely high speed signalling will be the keystone that bridges the gap between the theoretical and practical limits of parallelism.
In the next decade, compilers will contribute, on the order to 10-20% each year, to the area of performance enhancement. They will make healthy contributions to the area of design time and energy consumption reduction. The improvements will be enabled by recent breakthroughs in code analysis, code transformations, and debugging of optimized code. This will be facilitated by compiler-friendly architectures such as EPIC. Higher rate of improvement will be seen in embedded applications than the more traditional areas.
For next decade, there will not be a huge advance in gaining speed in the transistor and interconnects level. CR of the transistors and interconnects could grow more rapidly than the current drive of transistors. Perhaps, transistor/interconnects technology development in accordance with new system architecture and assembling technologies would become a solution to the problem.
Due to process technology constraints, the industry will gradually transition from pushing single-processor general purpose performance to both multiprocessor and special-purpose processing. Process technology and circuit technology will continue to deliver increases in frequency, but not at the rate of the previous decade. Die size is no longer going to be limited by equipment or manufacturing cost, but rather by power. To date the approach has been to lower voltage with each process generation. But as voltage is lowered, leakage current and energy increase, contributing to higher power. And the problems extend beyond power dissipation to power delivery/distribution and increasing power density. Microarchitecture techniques that have been used to increase performance have exacerbated the power problem. We must develop new techniques that are more power-efficient, and look beyond single-thread, general purpose performance to address the new challenges.
The next decade will see substantial increases in processor performance on many fronts. The traditional levers of frequency and instruction level parallelism can continue to be exploited. Power consumption and interconnect delay make these gains somewhat more difficult, but there is plenty of headroom and the march of technology will continue to offer tools to increase performance. The problems of power and interconnect delay will continue to favor gains in frequency over gains in instruction level parallelism. IO and Memory bandwidth can also be significantly increased and the huge increase in available transistors will in the next decade allow a revolution in memory bandwidth and latency provided significant increases in performance. Compilers, compiler directive instructions and specialized datapath operations will all bring performance gains in certain application spaces. Finally, thread level parallelism will become commonplace over the next decade as naturally parallel applications, tools, multiprocessors and simultaneous multithreading processors become ubiquitous.