Design Article

Timing Closure Using Layout Based Design Process

Chung-Kuan Cheng

5/4/2001 12:00 AM EDT

 
ABOUT THE AUTHOR

Chung-Kuan Cheng received a Ph.D. degree in electrical engineering and computer sciences from University of California, Berkeley in 1984. From 1984 to 1986 he was a senior CAD engineer at Advanced Micro Devices. In 1986, he joined the University of California, San Diego, where he is a Professor in the Computer Science and Engineering Department, an Adjunct Professor in the Electrical and Computer Engineering Department. He served as a chief scientist at Mentor Graphics in 1999. Chung-Kuan Cheng can be contacted at kuan@cs.ucsd.edu.
 

A successful timing closure design process starts with a clear timing specification. Convergent design iterations require tightly coupled performance-driven synthesis, circuit extraction, and timing analysis. The coupling between different design processes rests on physical layout information.

With the advent of deep submicron technologies, the dominance of circuit performance has shifted from logic to interconnect. Therefore, designers have to shift the design paradigm from a conventional logic-dominated to an interconnect-dominated design process. While the integrated circuit is confined to a two-dimensional area, the non-overlapping physical constraint forces component spreading. Thus, physical layout determines component interconnect. In order to achieve a desired timing-closure result, designers have to adopt physical layout as the core of the whole design process The goal is to achieve timing closure under the physical-layout constraints of geometry and routing congestion. The objective is to obtain fast but consistent high-quality designs incorporating many constraints.

One fundamental issue of timing closure is the modeling of physical overcrowding. The problem involves, among other factors, the representation and the handling of layout issues. These issues include placement congestion, overlapping of arbitrary-shaped components, routing congestion due to power/ground, clock distribution, signal interconnect, prefixed wires over components, and forbidden regions of engineering concerns. While a clean and universal mathematical model of physical constraint remains open, we tend to formulate the layout problem using multiple constraints with sophisticated details that complicates the implementation.

We need to consider multiple constraints with a unified objective function for a timing-closure design process. This is essential because many constraints are mutually conflicting if we view and handle their effects only on the surface. For example, to ease the routing congestion of a local area, we tend to distribute components out of the area to leave more room for routing. However, for multi-layer routing technology, eliminating components does not save much on routing area. The spreading of components actually increases the wire length and demands more of routing space. The resultant effect can have a negative impact on the goals of the original design.

In fact, the timing can become much worse. Consequently, we need an intelligent operation that identifies both the component to move out and the component to move in to improve the design.

Architecture Design
At architecture design level, the design space is huge. Designers engineer the circuits through various tasks such as repartitioning the hierarchy, choosing functional blocks, interface considerations, and testing methods. Physical-layout information is necessary to reflect the effects of these choices in terms of timing and layout feasibility. In addition, designers document timing specification at this stage for performance evaluation.

Throughout the design process, the design keeps on changing. Some logic descriptions are not complete in the beginning. Some descriptions are composed of functional blocks with exact physical layout—these blocks are called hard blocks. One hard block may correspond to many different implementations. We therefore have to make the best choice among a set of configurations of block width, height, and performance.

Some logic descriptions correspond to a netlist of standard gates. These logic functions will be implemented by standard-cell placement and routing—soft blocks. In floorplanning, soft blocks are allocated space for standard-cell layout. We allocate the area using estimation based on the most updated knowledge of the blocks as the design progresses.

To achieve a timing-convergent operation, designers need to obtain feedback regarding their choice of architecture. As the design changes, the design tool revises the budget of area and timing according to physical constraints. In other words, component physical dimensions, interconnect widths, interconnect spacing, and clock and power/ground distribution are as important as the design’s functional description. Right from the early architectural design stage, we need to generate physical layout with global routing, interconnect optimization, RLC extraction, and timing analysis, all with incremental-change capability to dependably derive the design’s clock-cycle period.

Designers have to document a clear specification of timing requirements that describe the desired clock cycle time, number of cycles, false paths, and multiple cycle paths. Although a designer will revise a timing specification throughout the design process, it is important to propose a blueprint as early as possible. The specification constitutes a guideline of the timing-convergent process. Adding a timing requirement as an afterthought tends to be error prone and takes even more effort to remedy a problem.

Floorplanning
Floorplanning exploits the design hierarchy to reduce layout complexity. Given a circuit of functional blocks, a designer partitions chip area for allocation of hard and soft blocks. For each block arrangement, you need to organize global wires, including clocks, power/ground, and buses. Fundamentally, the handling of physical shape and area relies on elegant physical expressions.

Floorplanning searches for a physical-layout solution at a high level of the design hierarchy. Circuit size increases along the path of Moore's law and circuits with millions of transistors are becoming a normal occurrence. On the other hand, the design-issue complexity also grows exponentially due deep-submicron process technology. In order to divide and conquer the complexity, the design is handled in a hierarchy fashion.

Interconnect architecture planning is important at the top level of the hierarchical design. Because of the scale of system-on-a-chip (SoC) designs, floorplan variations have a significant effect on the timing of interconnect between functional blocks. The design of clocking strategy, communication protocols, number of buses, bus organizations, and functional partitioning along with the buses is strongly affected by the design’s physical structure. Designers thus need to search at the floorplanning stage for the best choice under the constraints of timing, routability, and power dissipation. The floorplanning routes the topology and defines the spacing of clock, power/ground, and bus interconnect lines. Therefore, the impact of clock skew, IR drop, and signal delay is judged at the floorplanning stage of a design. One conservative approach even fixes the wires at this stage This approach ensures the timing of the critical nets with consideration of all details, including crosstalk.

For timing-driven floorplanning, we have encountered enormous obstacles trying to handle arbitrarily sized blocks with artificial virtual pins. The problem is more than assembling puzzles since along with shape and area considerations there are other constraints, such as interconnect and timing. To simplify the block arrangement, one practice is to limit the layout into a slicing structure. For a slicing structure, we assume that the topology of the floorplan can be defined by a recursive slicing of the chip area with straight horizontal or vertical lines. For example, the whole chip is first divided into two rectangles by a vertical straight line. Then, each rectangle is divided into two rectangles by a horizontal line, and so on. The simplification allows the use of elegant algorithms with efficient execution time.

On the other hand, in order to achieve leading-edge competitive designs, a non-slicing structure is the way to compact the chip and optimize performance. The non-slicing structure covers the topology that cannot be divided by straight lines recursively as you can do with the slicing structure. One important issue of a non-slicing structure is its representation. A conventional method represents the physical relation between adjacent blocks with horizontal and vertical constraint graphs. However, the manipulation of the constraint graphs complicates the design’s implementation.

Recently, Murata et al introduced a sequence-pair representation that describes the floorplan with two sequences of block labels. This method simplifies the representation, but results in a high-complexity translation between the representation and the floorplan. Furthermore, the representation intrinsically covers a huge redundancy. Many representations correspond to the same floorplan. Following up the sequence-pair introduction, Guo et al devised a rooted ordered-tree representation, which is concise and efficient in terms of redundancy and transformation. This method demonstrates performance that is competitive with the quality of manual layout in terms of area and wire length. With all these developments, floorplanning becomes an excellent research topic to explore.

Logic Optimization
Logic synthesis optimizes the timing with local logic transformation. For a subcircuit under logic transformation, interconnect is partitioned into global and local signals. The global signal connects the subcircuit with external components, while the local signal connects components within the subcircuit. Extracting from interconnect-routing data, we derive the RLC values and thus the timing window of global signals. The wire-load model of local signals is estimated according to the layout distribution of the subcircuit. For logic optimization, the current state of the logic-synthesis algorithm locally reconfigures the netlist. The global timing information is thus valid at this stage.

Buffer insertion and gate sizing are two effective approaches to improve interconnect delay. However, buffer insertion or deletion changes the netlist. The design system needs to account for this insertion/deletion effect in the design’s hierarchy.

We can also take advantage of equivalent pins to improve the timing. This operation does not involve gate changes; instead, the operation identifies the set of pins that can be swapped with the equivalent logic function. Usually, the technique of testing is adopted to find the cluster of equivalent pins. In large systems, we observed many AND, OR, or XOR logic functions with a large number of inputs termed as super-gates. We implement these functions with trees of smaller gates. Note that the super-gate structure may be obscure due to a logic transformation, such as DeMorgan's rule. Once we identify the super-gate, we can swap the pins to improve the timing.

The set of equivalent pins of super-gates covers not only the primary inputs of the super-gates but also the internal pins. The swapping involving the internal pins changes the tree structure of the super-gate to improve circuit performance.

Placement
Placement is the backbone of the entire design system in the sense that it supplies basic layout information to rest of the design processes. Placement quality strongly influences the quality of the design. Optimization operations use the features of timing-driven placement, incremental placement, and artificial-constraint handling throughout the design process. Potentially, macrocell and datapath placement can further improve layout quality.

Placement Quality
For the choice of placement approaches, quality comes before other features. Placement quality greatly influences timing and routability. Interconnect length cannot be shorter than the physical distance between the two components. Thus, it is not wise to push for execution-time efficiency and sacrifice design quality. On the other hand, the placement algorithm needs to be fast enough to respond to the request of other operations so that the placement does not become the bottleneck of the design process.

Timing-Driven Placement
Timing-driven placement optimizes the timing with a routability constraint. Timing is determined by the longest and shortest path delays between chip pins and registers, and registers-to-register. One mathematical approach is to adopt the mechanism of the Lagrange multiplier (shadow price) of the constraints. The subcircuits that violate the timing constraints are assigned heavier weights. The weights of timing together with the weights of routability and area congestion provide a placement metric to search for a better solution. The net weights are revised dynamically to reflect the overall effect on the timing constraint as placement changes. When the iteration converges, the derived solution is optimal.

The success of timing-driven placement relies on the quality of the placement and the accuracy of the timing model. The quality of the placement is crucial to the speed of the circuit. If the placement cannot allocate the space and interconnect well, there are no space resources left for other optimizations to manipulate. This leaves little hope to recover circuit speed by other processes.

In terms of the accuracy of a timing model, if the timing model “misses” design-information details, such as routing topology, capacitance coupling, false paths, or the effect of perturbation on neighboring objects, the timing-driven effort may prove useless. On the other hand, because the timing-driven placement perturbs placement at the expense of wire length and routability, longer wire length and higher routing congestion is also detrimental to circuit speed. A study shows that area cost vs. speed for interconnect optimization is a function of diminishing return. As the signal delay decreases, it costs more routing area and buffers to gain lesser signal speed. After a certain point, it costs a lot of resources for very little gain. Furthermore, the added buffers and routing area may require incremental placement to adjust the layout. Suppose the delay change due to placement perturbation is larger than the estimated gain. For this case, the extra routing area and buffers may not help the signal but instead cause congestion and delay of other signals.

Incremental Placement
Incremental placement is necessary not only for engineering changes by designers but also for the optimization process. During logic optimization, clock and power/ground distribution, and interconnect optimization, as we insert buffers, size gates, and adjust routing space, area congestion and routability change. We need to revise the placement to accommodate these changes. Although designers want to minimize the perturbation in incremental placement, it is important to keep the timing-driven objective function as a guide. The incremental placement has to tie strongly to the optimization process. If the overall effect of the optimization after incremental placement is worse than before the placement, we should recover the design prior to the incremental placement and pursue other optimization operations.

Artificial Constraints
It is important to keep the true object function and communicate among the optimization processes. However, there are cases, due to the lack of an established mechanism, where people adopt artificial constraints and hope the constraints can help achieve certain design goals. For example, region constraint is used to keep a set of components within a specified area. If this set of components is connected with signals critical to the clock cycle period, the region constraint helps in the sense that it enforces this set of critical signals within the dimension of the region. Corresponding to region constraint, a group constraint gathers the set of components within a certain distance but without specifying an exact region. Since the placement is a complicated task, a well-planned region constraint can usually help the tool to find better placement solutions. On the other hand, an ill-informed constraint could be detrimental because the region constraint does not represent an exact timing requirement. As the research and development of timing-driven placement progresses, we will see less dependency on such artificial constraints. However, it will take a long time for these constraints to disappear unless the design tools can exactly comprehend all of a designer’s requirements.

Macrocells and Datapaths
A mixed-mode placement of macrocells, datapaths, and standard cells remains a research problem. The problem involves placing blocks of arbitrary sizes and shapes, a huge number of standard cells, and embedded datapath structure with certain arrangements favored by the designers. In practice, we fix the macrocell before performing the standard-cell placement. However, as the number of macrocells increases, the complexity of the whole process increases. There is room for further circuit improvement if we can perform placement of all components simultaneously.

It is desirable to arrange a datapath block according to the flow of the data to optimize circuit speed. The datapath information may be passed from a module compiler or extracted from the netlist according to certain rules. A good placement tool can usually cluster components in a datapath according to its structure. However, the tool may not be able to organize the datapath into structured arrays because of interaction with other components outside the datapath. On the other hand, array assignment is not always the optimal solution for a datapath. First, the datapath netlist may not be regular due to an automatic logic-synthesis process. Second, the datapath may be placed in an odd shaped area with no room for an exact rectangular shape. Third, the datapath may need to communicate with sources and destinations spread in different places. An exact rectangle placement may be good for internal interconnect, but harmful for such external communication. Therefore, datapath placement remains an interesting research topic.

Interconnect Analysis and Optimization
For interconnect analysis and optimization, we measure the routability of the layout and optimize the topology and buffer insertion of the nets. We use a global router to estimate routability. The global router calls interconnect optimization to synthesize the nets while avoiding routing congestion. The inserted interconnect buffers serve as anchor points so that detailed routing can follow the resultant topology.

Global Routing
Global routing is a popular approach for routing-congestion analysis because of its fast turnaround time and very high correlation to actual routing results. In contrast, theoretical statistical analysis can also be an efficient method. However, its accuracy remains an open issue. Detailed routing provides the exact solution, but the details involved make incremental placement improvement extremely difficult.

The state-of-the-art approach to global routing is the multiple commodity flow algorithm. The routing of each net is conceived as a flow of each commodity. The cost of the routing is revised dynamically according to the congestion of the routing area. The router performs a rip-up and reroute iteration to minimize congestion. From the multi-commodity flow formulation, we can identify an upper bound of the solution under certain assumptions. In practice, an elegant implementation can find results approaching the bound.

The model of global-routing capacity is crucial to the accuracy of the routability estimation. A simple approach is to count the number of free tracks over a routing region as the routing capacity. However, not every interconnect wire is comprised of straight wire and take only a single track. A wire-segment jog or via placement to change layers affects at least two tracks of the routing resources. On the other hand, two small obstacles on two tracks may still allow a wire to go through, with detours.

According to current practice, if the congestion sparsely spreads over the chip, the detailed router will be able to complete the routing. However, if the congestion region is heavily congregated, the detailed router will have a difficult time trying to complete the routing.

Interconnect Optimization
Interconnect optimization involves the routing synthesis and buffer insertion. At the global routing stage, we search the tree topology. Buffer insertion dissects the tree topology into sets of smaller trees. For this situation, it helps to relay the topology defined by the global routing to the detailed routing.

The insertion of buffers has benefits in three different aspects. First, an added buffer on a branch shields the heavy load downstream on the branch from the rest of the tree. Second, buffers can be used to recover the slope of the signal’s transition edge and screen out the noise. Third, a designer can use buffers to boost driving power and reduce delay. A dynamic programming method can cover all situations and find an optimal solution with reasonably efficient execution time. In Interconnect Analysis and Synthesis the authors describe a Permutation Tree algorithm to combine topology and buffer-insertion searches. The method finds optimal solutions with constraints on the order of the tree topology. However, the complexity of the method is high for extra large nets.

Due to limited layout space, interconnect optimization has to balance routing congestion and interconnect performance. Using a primal and due formulation, we can set shadow prices for the congested area and nets on critical-timing paths. Interconnect optimization calls a rip-up and reroute procedure to reduce the unified cost function.

RLC Extraction
The RLC extraction calculates the electrical parameters of interconnect. At the global-routing stage, the exact coupling is unknown. One approach is to adopt a statistical model to predict coupling capacitance according to routing congestion. For each unit wire, we find the congestion on the layer of, the layers above, and the layers below the wire. We then derive the lateral, area, and fringe capacitances according to the coefficients attached with each layer using a formula containing a function of the congestion.

Timing Analysis
Timing analysis derives the signal delay and measures the timing-constraint violations. Basic capabilities of a timing-analysis tool include handling of false paths, multiple clock cycles, latches, and incremental analysis. For interconnect delay, conventional Elmore delay may not be accurate enough for uneven nets. When the tree structure of the net is lopsided, the Elmore delay model tends to be overly pessimistic on nodes near the driver due to shielding effect of downstream resistance. We need a high-order moment analysis for better accuracy. Recent work has studied the stability issue of the high-order moments including inductance. However, the application of delay analysis to measure quality of partially completed interconnect during the routing process remains an issue.

For a fast static analysis, gate delay is usually formulated as a function of input slew and output load capacitance. In order to account for the shielding effect of interconnect resistance, analysis uses a C-effective method to iterate between the exact RC tree of the interconnect and an effective load capacitance. The analysis tool then uses the effective load to pin down the delay model of gate delay. The tool then applies this delay model to the exact RC tree at the input. The derived delay of the RC tree is then fed back to adjust the effective load capacitance until the iteration converges.

To estimate delay due to the cross talk, interconnect coupling-capacitance is converted to self-capacitance with different weights for shortest and longest delay. The contribution of crosstalk to gate delay is complicated due to nonlinear models of active devices. The calculation of delay incorporating crosstalk of very large circuits remains a research issue.

Clock Optimization
Clock distribution complexity increases with system frequency. It is important to plan the clock signal distribution from the beginning of the design process. You need to view clock distribution on a global basis, including multiple phases, gated clocks, partial clock nets, and clock pins of macro cells. You need to budget distribution delay and clock skew, and architect the design’s clocking scheme.

Clock synthesis involves topology synthesis, routing, and buffer insertion. The topology synthesis places and clusters the registers, along with defining the level of the distribution and the topology of the clock distribution network. Buffer insertion involves finding the buffer sizes and buffer insertion locations. Clock routing makes connections according to the clock’s topology.

The goal of clock distribution is to achieve the design’s clock-propagation and clock-skew budgets with signal-delay variation as an objective function of the cost of routing space and buffer area. Delay variation is due to many factors, including wire-characteristics variation due to routing and fabrication, gate-response variation due to fabrication, thermal distribution, supply power changes, and noise. It is important to incorporate all these effects when deriving the clock distribution network.

The ultimate goal of clock distribution is to satisfy the design’s hold and setup time constraints. You can even allocate designated skew to maximize the tolerance of delay variations. Given a combinatorial block, the shortest signal delay from its input to its output defines the hold time constraint between the register at its output and the register at its input. The longest signal delay defines the setup time constraint. Hold and setup time constraints determine the lower and upper bounds of clock skew between the pair of input and output registers.

We define the designated skew by shifting the clock skew to the middle point between upper and lower bounds. The designated skew maximizes skew tolerance. However, the designated skew of one pair of registers affects the setup and hold time constraints of the registers adjacent to this pair. We can propagate the skew effect up to the chip pins or until the propagation paths close into loops. Suppose we formulate this problem as a directed graph, each node of the graph representing a register. The setup and hold time constraints between two registers are represented by a forward edge and a backward edge between two corresponding nodes, respectively. The shortest loop of the graph determines the optimal solution of the designated skew.

Clock distribution is strongly influenced by other synthesis processes such as logic optimization, interconnect optimization, and placement, since circuit performance determines setup and hold time constraints. Furthermore, register and buffer allocation compete with other components for placement resources. Consequently, clock synthesis involves more than the clock buffers and clock topology. You need to tightly couple clock synthesis with the rest of your synthesis processes.

Handling Coupling Noise for Signal Integrity
Coupling noise is a serious problem in deep-submicron designs. At the global routing stage, we can only predict the scale of coupling using routing congestion. You can define exact crosstalk coupling only after actual wire routing on the chip. A post-layout optimization step will try to perform gate sizing, buffer insertion, wire width and separation sizing, wire shielding, and rip-up and reroute within the layout.

Tool Management
Tool integration is a difficult task. The conventional approach of communicating between tools using files is too slow for tight tool integration. Adopting API becomes a viable approach for tools developed by different organizations. A tighter approach is to use the same data structure to communicate. In such cases, the core of the data is a superset of all design information. We need to design the software architecture so that the system can handle huge designs with limited memory space.

Throughout the design process, required information may not be complete. We need to organize the package so that each tool can operate correctly with incomplete information. For example, before global routing, you do not know a design's interconnect configuration. Interconnect timing analysis uses the best available knowledge to derive a proper model to calculate the delay using certain buffer insertions. Once the global routing is complete, timing analysis should take into account actual routing data to calculate delay.

Tool Adoption
The complexity of each individual tool is very high. For example, it takes several years for an EDA tool design team to polish a layout tool so that it is mature and stable. A tool that produces excellent results just on one test case may fail to give the same quality operating on the same design with just minor changes. While the design keeps changing, we need the same quality from the same tool throughout the life of the design project. As tool technology keeps progressing, the replacement of design tools also becomes a factor. Although designers use such tools to help do designs, tool usage along with design files form part of the design team's intellectual property. Consequently, it is important to embrace new tools gradually and still have viable alternative design strategies.

Acknowledgement
We acknowledge the grants from NSF 9987678 and the California MICRO program.




Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form