Design Article
Implementing a Heap-Based WRR Scheduling Mechanism: Part 2
Sailesh Kumar and Prosun Chatterjee, Paxonet Communication
8/28/2003 5:39 AM EDT
The weighted round-robin (WRR) scheduling mechanism has become a mainstay technology for today's networking switch and router designs. To meet the growing performance demands of high-speed networking equipment, a heap-based WRR architecture has been developed that that can perform true WRR based selection at very high speeds for large number of queues with very high disparity in weights.
This is the second installment in our look at the proposed WRR architecture. In Part 1 of this paper, we provided an overview of the typical WRR architecture and discussed the proposed implementation. Now, in Part 2, we'll further the discussion by looking at some limitations with the architecture as well as solutions to these limitations. We'll also examine jitter and fairness performance as well as how this architecture stacked up against another WRR method.
Let's kick off the discussion by the limitations designers will encounter with the proposed heap-based WRR algorithm.
Limitation 1: Node Swapping Clashes
As Part 1 of this article pointed out, the proposed heap-based WRR queuing architecture works by comparing and swapping nodes. There are, however, some limitations in the seemingly straightforward comparison and swapping of nodes. One of them is due to the clashes that might occur while making any node swapping decisions. Another limitation may arise due to the complexity involved in the computation of ΣWi (active) to be used in step 4. Let's look at each of these two issues, starting with the clashing problem.
A sub-section of the tree data structure with three nodes and 5 branches is shown in Figure 2.

Consider a case in which, the comparison results indicate that the node 1 has to be moved up while node 2 and 3 also needs to be moved up. To make the situation even more complex, the lower comparators may indicate that the nodes 2 and 3 are supposed to be pushed down. In such a scenario, swapping logic across only single branch will create clashes, multiple drivers, and might result in spurious swapping.
To resolve the issue of clashing swapping results, a solution can be to block some of the comparators and swappers while others are active. Some other solutions include:
1. A possible solution is to shutdown the alternate layer comparison and swapping logic. However, even this doesn't solve the problem completely. Cases might appear when Ci's of both of the lower nodes are greater than Ci of the upper node and therefore both the lower nodes need to be pushed up. In order to resolve the clash, an additional comparator for comparing the lower two nodes is required. Thus, the hardware overhead will be (N-1)/2 additional comparators. In addition, since the layers are activated alternately, the sorting time will double up to O(2Log2N). This implementation is referred to as Layered WRR implementation. We will show the performance primitives of this implementation in a later section.
2. Another solution is to come up with a sequence in which the branches are activated so that it will resolve any possible clash. The order in which comparators and swappers are enabled and disabled is depicted in Figure 3.

We have devised equations that depict the sequence in which the comparators are made active. The diagram above shows that this algorithm eliminates any clash but on the other hand, the sorting time triples up to O(3Log2N). This scheme doesn't have any hardware overhead.
The implementation shown in Figure 3 is referred to as non-clashing sequence WRR implementation. Performance primitives of this implementation are also given in later sections and are compared with the layered WRR implementation.
Limitation 2: Computational Complexity
Another limitation that might arise is due to the fact that after each selection, the urgency counter of the selected node has to be decremented by a factor equal to ΣWi (active). However, the computation of ΣWi (active) might take more than a single time interval depending on the width of weight and number if queues. Therefore, this computation, which has to be carried out after each selection, may become the bottleneck and its computation time will become the actual selection interval.
A possible solution to this problem is to implement pipelined arithmetic components, which will compute the result in every time interval after an initial latency. However, the pipelined results might become incorrect after any change in the queue activity status.
The proposed WRR architecture is tolerant to any change in the queue activity status at the cost of a little compromise in the fairness. Furthermore, simulation results prove that the compromise in fairness is small and is a function of variance in queue activity.
To achieve these performance levels, a pipelined arithmetic addition scheme is implemented that uses the Wi's and the queue activity status to calculate the ΣWi (active). The implementation is shown in Figure 4.

Figure 4 shows an m-deep pipelined adder. Each pipeline stage performs an N/m input addition of Wi's. A total of m different weights (Wi's) are kept at each of the pipeline stages. Since the implementation is pipelined, a change in queue activity status will cause improper computation results for a time equal to the pipeline depth. An improper computation can violate the axiom 1 that the ΣCi is constant and equals the initial sum. One might think that after large number of such violations, the mean Ci (or ΣCi) will deviate from initial value and the urgency counters will start under flowing or overflowing. But we have seen in the simulations as well as we shall prove that the ΣCi deviation from the initial value will always be bounded and will tend towards the initial mean ΣCi.
Consider a case when all the queues are active and suddenly all the queues whose weights are placed at the very first pipeline stage of the adder becomes inactive. Thus, the ΣCi will decrement by a value X (where X is the sum of weights of all the queues that have just become inactive) for an interval equal to pipeline depth m. This will happen because the now inactive queues will not see any gain in their urgency counters (placed in the heap binary tree), while the currently selected queue's urgency counter will be decremented by additional ΣWi (now inactive) for m time intervals. Thus, the ΣCi has violated the axiom 1 and has digressed from the initial value.
However, if seen carefully, it can be found that when these inactive queues again becomes active, they will contribute to an equal gain in the ΣCi for same time interval m. This is because in the pipelined addition implementation, each of the queues weights always lies at the same stage.
Thus, if a queue i at stage x in the adder (where 0
However, since the change is there in ΣCi instead of individual Ci's, the probability of the individual Ci's getting over (under) flown is less likely. In simulations, we have never seen any of the counters getting saturated at the lower or upper bounds. But, to be on the safe side, an increase of 1-bit in each of the Ci will eliminate this problem. In addition it is recommended to put larger Wi's at the end of the pipelines so that the impact of change in queue activity status is minimal.
There can be some other possible solutions to this problem, like implementing an equal m-deep queue activity status pipeline. The activity status lying in the pipeline will be used in the heap tree, and thus using a same stale status vector in both the binary tree as well as addition pipeline. This scheme will make ΣCi always constant and hence will eliminate any need to increase the size of urgency counters.
However, both of the above schemes will have an impact on the fairness, because after each change in the queue activity status, the selected queue's urgency counter is decremented by a spurious ΣWi (active). Thus, the toll on the fairness is proportional to the variance in the queue activity. We have shown the compromise in the fairness as a function of variance in queue activity in subsequent sections.
Performance Primitive 1: Fairness
The variance in the queue activity status is benchmarked with respect to the events of a queue getting inactive. The event of getting active is not taken into account, because over a large interval of time, the total events of activity and inactivity are approximately same. If Ti is the time when a queue becomes inactive for the ith time then, the variance in queue activity status is:
The benchmark of fairness is an ideal implementation of WRR (without any "additional operator" latency) with similar configuration and queue excitation vector. The normalized fairness for all the N queues is expressed in the following equation
We have found that normalized fairness f is independent of the variance in queue weights and is approximately inversely proportional to the variance in queue activity status σ. However the relationship heavily depends on the value of m in the addition operator pipeline. The waveform of our simulation results depicting the f as a function of σ is shown in Figure 5. The result shown below is computed after running the model for 10,000 iterations, with 64 queues and m equal to 8. In general we have observed that the fairness also increases with total number of iterations.
In any urgency counter-based WRR implementation, fairness is not a concern for lesser variance in the queue activity status. In realistic situations, even a variance of 0.2 is considered high (in our benchmark) and we have seen that fairness is hardly affected at such levels of variance.
Another performance primitive namely jitter is highly implementation specific and the aim should be to minimize the jitter. The faster the sorting algorithm works, the lesser the jitter will be. Realistically, for faster selections, the sorting has to very fast and various pipelined sorting algorithm has been proposed.1,2,3 However, the change is the queue activity disrupts the normal operation of all these pipelined approaches. The algorithm that is most resilient to high variance in queue activity status and emerges faster from changes in the queue activity status generally has lesser jitter. In general a faster sorting algorithm is invariably less prone to be affected by high variance in queue activity status.
Performance Primitive 2: Jitter
Unlike the existing work, we do not approximate the background traffic with a known and simple distribution. Rather we assume that the traffic from a particular queue is directly proportional to the number of selections the queue has. In addition, we have modeled a threshold based random distribution for making an inactive queue active. A queue is inactivated after it has been selected a random number of times. Initially all the queues are assumed to be inactive. The variance in the queue activity status is benchmarked as explained above in fairness computation.
Our mathematical model of jitter is normalized and it always lies between 0 and 1. If Ti is the time stamp when a queue is selected ith time, then the normalized jitter for this queue can be expressed as:
Also, whenever, a queue becomes active, fresh jitter computation is initiated and the new jitter is normalized with the previously calculated jitter. In our simulations, we have calculated the jitter for each of the queues individually and then drawn waveforms of its relationship to the queue's weight. The waveforms are shown in the Figures 6 and 7.
In Figure 7, the jitter has been shown only for the small variances in the queue activity. This is because, for higher variances, the queues become inactive just after few selections. This causes improper computation of jitter mostly related to the computation overhead at the beginning. Therefore, in order to enable sustained jitter computation for large number of selections, the variances are kept low. In realistic situations also, generally variances are very low. In cases where the queue activity variances are high, a relatively higher jitter is also acceptable.
In general, we have observed the following phenomena:
WRR Performance Comparison
The figures mentioned in the table are approximate and depicts only the comparative performance of the two WRR implementations.
A software implementation of heap-based WRR algorithm might conceptually indicate that the complexity has decreased considerably. However, when viewed from the real computational time perspective, the generic microprocessor based systems perform a single comparison in one machine cycle time. Thus in real, the time taken to sort a heap comes out to be equivalent to time taken by system to perform NLog2N comparisons. However, when implemented in hardware, the algorithm works considerably faster, as all the comparators operate in parallel. Since there are N comparators, the algorithm complexity decreases by a factor of N. The current scenario also illustrates the fact that custom ASICs targeted for specific applications enhances the performance considerably as compared to a generic microprocessor based solution.
In addition, the jitter seen in the WRR model is tolerable even for large number of queues and high queue activity variance. The fairness is also guaranteed since any urgency counter based implementation inherently is fair to all the queues. Our model has shown that the fairness is minimally affected when we implement pipelined schemes for urgency counters processing. Even though the implementation results in considerably higher gate counts, an extremely fast operation that can support scheduling at very high bandwidths justifies it.
Wrap Up
Editor's Note: To view Part 1 of this article, click here.
References
About the Authors
Prosun Chatterjee is a team leader at Paxonet Communications. Prosun can be reached at prosun@pune.paxonet.com.
The heap-based WRR implementation described in this article has proven to be fair to all the queues for a large variance in the queue activity status and is almost independent of the queue weights. Inherently any urgency counter based WRR implementation is absolutely fair because it works on the history of the queue activity and selection order. If at any time, a queue suffers momentarily then its urgency counter increases by the amount it has suffered. Afterwards, that queue is compensated with more selections. However, as explained above, because of the addition operation pipeline the fairness gets compromised a little bit. We have provided simulation results of fairness as a function of variance.

where N = the number of inactivity events.

where Si(idea)l = no of ideal selections of i queue, Si(real) = no of selections of ith queue in our WRR.

Many authors have addressed the problem of benchmarking jitter. 3,6,7,8,9,10,15. The jitter calculation in this paper departs from many of the existing work in number of directions. First, we consider the case where the individual queues are periodic. Next, we assume that for each selection of queue an equal amount of data is sent.

where Di = Ti-Ti-1


We have modeled both of the above-mentioned WRR algorithms and synthesized them. The pipelined addition was used in both the models. The comparative performances of the algorithms are shown in Table 1.
In this two-part paper, we have proposed an architecture, which performs true WRR based selection at very high speeds for large number of queues with very high disparity in weights. Though many such schemes are already in existence, the paper discusses some of the bottlenecks in actual implementation, and then addresses them. A number of different possible schemes have been proposed and comparative performances are given. The scheme proposed here is useful for very high-speed applications. Lower speeds may use alternative schemes that yield lower gate counts. For applications that need scheduling to be performed seamlessly at speeds as high as 160 Gbit/s and above, the current scheme can be extremely useful.
Sailesh Kumar is a senior design engineer at Paxonet Communications (India) Ltd. Sailesh can be reached at sailesh@pune.paxonet.com.



