News & Analysis

Multithreading optimizes servers

John M. Borkenhagen, Steven Kunkel, Senior Technical Staff Members, IBM Server Group, Austin, Texas

1/9/2002 7:52 AM EST

Multithreading optimizes servers
New applications that servers must handle are often large and function-rich; they use many operating system services and access large databases. Not only does that make the instruction and data working sets large, but the workloads are also inherently multiuser and multitasking. The large working set and high frequency of task switches cause high cache-miss rates, and such applications may also have data that is frequently read-write shared.

The multiprocessor configurations typical of commercial servers can make the miss rates significantly higher, and branch-prediction rates can be poor because of the large instruction working set. Those characteristics all hurt the processor's performance. Moreover, current trends in application characteristics and languages are likely to make this worse. For example, object-oriented programming with C++ and Java uses virtual-function pointers that did not exist in the languages used in older applications. Virtual-function pointers lead to branches that can have very poor branch-miss prediction rates and the frequency of dynamic memory allocation in these languages is also higher than in older languages, which leads to more allocation of memory from the heap. Memory from the heap is more scattered than memory from the stack, which can cause higher cache-miss rates.

Such considerations have led IBM engineers to enhance the architecture of the PowerPC architecture used at the core of such systems as its pSeries and iSeries servers. The most significant of these is the use of coarse-grained multithreading to enable the processor to perform useful instructions during cache misses. This provides a significant throughput increase while adding less than 5 percent to the chip area and having very little impact on cycle time. When compared with other performance-improvement techniques, multithreading yields an excellent ratio of performance gain to implementation cost.

Although multithreading is not a new idea, it had not previously been used in mainstream processors. Moreover, multithreading had not been used in processors targeted at commercial server applications.

An important aspect of commercial servers is the ability to run previously compiled applications without changes. To minimize the impact on software, it was decided that the multiple threads would appear like multiple processors. So only that small part of the operating system dealing with task dispatching and interrupts had to be modified.

The multiuser, multitasking nature of commercial server workloads provides an abundance of natural thread-level parallelism, which keeps the multiple threads in the hardware occupied without requiring applications or the operating system to be further parallelized. Because all commercial servers are already multiprocessors, making the multiple threads per processor look like multiple processors to the software did not require any change in the applications and required very little change in the operating system.

Of course, a single task could be parallelized into multiple threads to increase the performance of that single task.

Another problem that affects software is performance scalability on a multiprocessor system. It is more difficult for software to scale well on a large number of "processors." To minimize potential application scalability problems, the number of threads per processor is kept small.

In commercial servers, system throughput is the primary measure of performance, but single-task execution speed must also be competitive. Several decisions were made to ensure that the performance of a single task would be acceptable. Most significantly, the area on the chip devoted to multithreading had to be small, as did the cycle-time impact.

To keep the area impact small, two threads are implemented. While there is more throughput from more threads, performance analysis showed that two threads achieve most of the performance gain and that the performance benefit from each additional thread decreases. Also to keep area small, little more than the architected state of the task is duplicated. That is, the general-purpose register, floating-point register and most special-purpose registers are duplicated, but little else. Other major facilities are shared between the two threads.

Performance analysis also showed that there is only a small effect on the miss rates of the caches, particularly the Level 2 cache, which has the longest latency for a miss.

An important aspect of single-thread performance was that the single task also had to be able to consume all of the resources of the processor when needed. In fine-grained multithreading a different thread is executed every cycle. While fine-grained multithreading covers control and data dependencies quite well, the impact of cycle interleaving on single-task performance was deemed too large. As a result, the processor is designed to exploit coarse-grained multithreading. In such multithreading, a single thread, the foreground thread, executes until some long-latency event such as a cache miss occurs, causing execution to switch to the background thread.

If there are no such events, a single thread can consume all execution cycles. This minimizes the impact on single-task execution speed, making it performance-competitive with nonmultithreaded processors. Similar performance characteristics are achieved with simultaneous multithreading, but because the processor executes instructions in order, coarse-grained multithreading is the natural choice. In an out-of-order processor, simultaneous multithreading would be the choice.

The use of coarse-grained multithreading enables a single thread to consume all execution cycles, but it does so only if that thread has no events that trigger a thread switch. To give the task ( program) a degree of control on execution speed, multiple priority levels are implemented. Letting a task set its priority low or high allows it to consume either very few of the execution cycles or most of them-by restricting which events trigger a thread switch.

The effect of multithreading on response time to the user is a concern, because a task appears to execute more slowly. While the implementation allows a single task to have competitive performance, it does so by allowing the lower-priority thread on the processor to have very few execution cycles, in which case the throughput increase is small. If all tasks use high priority, the purpose of priority is defeated and nothing is gained. Maintaining good user-level response time cannot be achieved by using priority. In commercial servers the user-level response time is dominated by disk access time and network delays; these are not affected by multithreading. Performance analysis of the processor portion of user-level response time showed t hat response time improves with multithreading for most levels of utilization.

---

Richard J. Eickemeyer, senior engineer, IBM Server Group, Rochester, Minn.; and Ronald N. Kalla, senior engineer, IBM Server Group, Austin, Texas, contributed to this article.

See related chart





Please sign in to post comment

Navigate to related information

EE Buzz DesignCon

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form