Design Article
Using nextgen PCI Express switches to eliminate network I/O bottlenecks
Steve Moore, PLX Technology
2/6/2008 3:43 PM EST
For example, when a few bytes of Ethernet data get stuck behind large packets of FC data in the root complex, the latency that is introduced by this congestion will severely impact system response time and create bandwidth limitations (see Table 1 below).
![]() |
| Table 1. Ethernet latency bandwidth tradeoffs |
The next generation of PCI Express (PCIe) switches have added many new features to mitigate the effects of having to process competing data protocols, thereby improving overall system performance.
Advanced new features such as Read Pacing, enhanced port configuration flexibility, dynamic buffer memory allocation, and the deployment of PCIe Gen2 signaling are reducing I/O bottlenecks, providing dramatic improvements in system performance in server and storage controllers.
Performance Limited by "Endpoint
Starvation"
When two or more endpoints are connected to a root complex through a
PCIe switch, with unbalanced upstream versus downstream link-widths
(and hence unbalanced bandwidths) and an uneven number of read requests
are being made by the endpoints, one endpoint inevitably dominates the
bandwidth of the root complex queue. The other endpoints suffer reduced
performance as a result. This is known as "endpoint starvation," which
can make it appear as if the system is congested and not performing
optimally.
Figure 1 below shows a
typical root complex connected to two endpoints through a PCIe switch.
In this example, there is a x8 upstream port and two x4 downstream
ports. The FC HBA is a good example of an endpoint that could dominate
the bandwidth of the root complex queues.
In this example, the FC HBA makes several 2KB read requests, which are then queued by the root complex, filling up the queues in root complex.
![]() |
| Figure 1. Endpoint starvation |
While the queues are full, the Ethernet NIC makes two 1KB read requests. The Ethernet NIC must wait for the root complex to service all of the read requests from the FC HBA before they're serviced. Thus the NIC is "starved."
Read Pacing "Feeds" the Starving
Endpoint
Endpoint starvation is solved " and the endpoint is "fed" -- with a new
PCIe switch feature called Read Pacing, which is available on the
latest Gen 2 PCIe switches.
Read Pacing provides increased system performance with a more balanced allocation of bandwidth to the downstream ports of the switch. With Read Pacing, the switch can apply rules to prevent one port from overwhelming the completion bandwidth or buffering in the system.
Figure 2 below shows the same example, with a FC HBA and an Ethernet NIC on the downstream ports of a switch which aggregates traffic into a root complex. The FC HBA makes several 2KB read requests.
![]() |
| Figure 2. Read pacing eliminates endpoint starvation |
With Read Pacing, the switch controls the number of the FC HBA's read requests forwarded through at a time. Programmable registers in the switch control the number of read requests forwarded to the root complex.
As the Ethernet NIC makes its two 1KB read requests, the switch allows both read requests through, thus balancing the flow of data from both endpoints. As shown in Figure 2, a 2KB read for the FC HBA through the root complex is immediately followed by two 1KB reads for the Ethernet NIC, resulting in balanced traffic for each endpoint.
Read Pacing allows the Ethernet NIC to be serviced more frequently without impacting the bandwidth of the FC HBA. Hence, endpoint starvation is eliminated with Read Pacing. The chart below compares the performance improvement that can be achieved with and without using Read Pacing in a real world system, where the FC issues 16 4K read requests ahead of the Ethernet single 1K read request.
Increase Performance by Optimizing
Buffer Size Dynamically
Early PCIe switch architectures provided each port with a fixed amount
of buffer RAM. Figure 3 below compares
a typical type of buffer allocation, seen in the older switch designs,
with the new Dynamic Allocation scheme found in the latest Gen2
switches.
![]() |
| Figure 3. Dynamic allocation leads to more buffers |
In this example, a six-port switch is designed with a total of 30
packet buffers, with five buffer segments available on each port. If
only four ports are used, then the buffers allocated to the two unused
ports are wasted.
Since a larger buffer will translate into better performance, it would be nice if that unused memory could be used to increase the size of the buffers on the four ports that are being used.
In the latest Gen2 switches, it is possible to do just that. This
feature is known as Dynamic Buffer Allocation, where a shared memory
pool is available to any port, and the size of the buffer is allocated
dynamically depending on the number of ports in use.







