News & Analysis
Hyperthreads conserve hardware
Deborah T. Marr, CPU Architect, Intel Corp., Hillsboro, Ore.
1/9/2002 7:50 AM EST
The Internet is powered by ever-speedier server systems with blazingly fast processor performance compared with just a few years ago. But the techniques used to achieve a given processor performance improvement-superscalar execution, superpipelining, out-of-order execution, branch prediction-cost a great deal in terms of transistors and power consumption.
In fact, chip size and power are increasing at rates greater than processor performance. So processor architects are looking for ways to reverse this trend, to improve performance at a greater rate than chip size and power.
A look at the software trends reveals that server applications consist of multiple threads or processes that can be executed in parallel. Online transaction processing and Web services have an abundance of software threads that can be scheduled and executed simultaneously for faster performance. Intel architects have been looking to leverage this so-called "thread-level parallelism" to gain a better performance vs. chip size and power ratio.
One technique is core multiprocessing, where two processors are put on a single chip. A core-multiprocessing chip can provide a significant performance boost for multithreaded applications. However, at double the size of a single-core chip it is expensive to manufacture, and this approach doesn't begin to address the performance vs. chip size and power problems.
Another approach is to allow a single processor to execute multiple threads by switching between them. In time-slice multithreading, the processor switches between software threads after a fixed time period. Switch-on-event multithreading switches threads on long-latency events such as cache misses. However, those techniques provide less than optimal performance because the conditions to switch threads will not improve performance for other significant sources of inefficient resource usage, such as branch mispredictions, instruction dependencies and others.
One alternative is the use of hyperthreading technology such as implemented in Intel's Xeon MP, which allows a single processor to dynamically execute multiple threads at the same time. It makes the most effective use of processor resources to maximize the performance vs. chip size and power.
Hyperthreading technology makes a single physical processor appear as multiple logical processors; simply put, the physical execution resources are shared and the architecture state is duplicated for each logical processor. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on conventional physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.
A first implementation of hyperthreading technology will be available on a server processor, the Intel Xeon Processor MP, with two logical processors per physical processor. Each logical processor maintains a complete set of architecture state. The architecture state consists of registers including the general-purpose registers, control registers and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors.
To understand how instructions from two logical processors can execute simultaneously and better utilize the available hardware execution resources, it helps to view resource utilization graphically as a set of blocks with rows and columns. Each set of blocks represents one processor's execution resource utilization over time. For a superscalar processor that can execute three instructions each clock cycle, there are three columns. Each row represents a different clock cycle.
A conventional superscalar processor can execute only one software thread at a time but at a peak rate of three instructions every clock cycle. The first set of blocks represents a conventional superscalar processor executing an orange thread. In the first clock cycle, represented by the first row, the orange thread uses only two of the execution units and the middle execution unit is idle.
In the next cycle the thread does not execute any instructions. The instruction execution rate of applications varies significantly from application to application, but over time there may be a fair number of cycles with idle execution units.
The next set of blocks represents the instruction execution flow of a conventional multiprocessor with two processors. The peak execution bandwidth is six instructions every clock cycle. With the two processors executing orange and green software threads, respectively, peak execution rate is rarely achieved in this example.
The third set of blocks represents the instruction execution flow of a processor with hyperthreading technology. Here the orange and green threads are executing simultaneously and therefore can make more efficient use of the execution resources. Similarly, a multiprocessor with hyperthreading technology may also execute nearer the peak efficiency of six instructions per cycle.
The microarchitecture implementation of the Intel Xeon processor MP conceptually has five stages: instruction fetch, register rename, scheduling, execution and retirement. The goal is to feed instructions to the core of the processor-the scheduling and execute stage-as fast as possible by keeping the scheduler's queues full of instructions from both logical processors. Before and after the execution core there are a few points to select between instructions from the two logical processors. Selection points typically alternate between the two logical processors every clock cycle unless one logical processor is stalled.


See related chart
