Design Article
Timing Closure Using Layout Based Design Process
Chung-Kuan Cheng
5/4/2001 12:00 AM EDT
|
ABOUT THE AUTHOR
Chung-Kuan Cheng received a Ph.D. degree in
electrical engineering and computer sciences from University of
California, Berkeley in 1984. From 1984 to 1986 he was a senior CAD
engineer at Advanced Micro Devices. In 1986, he joined the
University of California, San Diego, where he is a Professor in the
Computer Science and Engineering Department, an Adjunct Professor
in the Electrical and Computer Engineering Department. He served as
a chief scientist at Mentor Graphics in 1999. Chung-Kuan Cheng can
be contacted at kuan@cs.ucsd.edu.
|
||
A successful timing closure design process starts with a clear timing specification. Convergent design iterations require tightly coupled performance-driven synthesis, circuit extraction, and timing analysis. The coupling between different design processes rests on physical layout information.
With the advent of deep submicron technologies
, the dominance of circuit
performance has shifted from logic to interconnect. Therefore,
designers have to shift the design paradigm from a conventional
logic-dominated to an interconnect-dominated design process. While
the integrated circuit is confined to a two-dimensional area, the
non-overlapping physical constraint forces component spreading.
Thus, physical layout determines component interconnect. In order
to achieve a desired timing-closure result, designers have to adopt
physical layout as the core of the whole design process 
The goal is to achieve timing
closure under the physical-layout constraints of geometry and
routing congestion. The objective is to obtain fast but consistent
high-quality designs incorporating many constraints.
One fundamental issue of timing closure is the modeling of physical overcrowding. The problem involves, among other factors, the representation and the handling of layout issues. These issues include placement congestion, overlapping of arbitrary-shaped components, routing congestion due to power/ground, clock distribution, signal interconnect, prefixed wires over components, and forbidden regions of engineering concerns. While a clean and universal mathematical model of physical constraint remains open, we tend to formulate the layout problem using multiple constraints with sophisticated details that complicates the implementation.
We need to consider multiple constraints with a unified objective function for a timing-closure design process. This is essential because many constraints are mutually conflicting if we view and handle their effects only on the surface. For example, to ease the routing congestion of a local area, we tend to distribute components out of the area to leave more room for routing. However, for multi-layer routing technology, eliminating components does not save much on routing area. The spreading of components actually increases the wire length and demands more of routing space. The resultant effect can have a negative impact on the goals of the original design.
In fact, the timing can become much worse. Consequently, we need an intelligent operation that identifies both the component to move out and the component to move in to improve the design.
Throughout the design process, the design keeps on changing. Some logic descriptions are not complete in the beginning. Some descriptions are composed of functional blocks with exact physical layoutthese blocks are called hard blocks. One hard block may correspond to many different implementations. We therefore have to make the best choice among a set of configurations of block width, height, and performance.
Some logic descriptions correspond to a netlist of standard gates. These logic functions will be implemented by standard-cell placement and routingsoft blocks. In floorplanning, soft blocks are allocated space for standard-cell layout. We allocate the area using estimation based on the most updated knowledge of the blocks as the design progresses.
To achieve a timing-convergent operation, designers need to
obtain feedback regarding their choice of architecture. As the
design changes, the design tool revises the budget of area and
timing according to physical constraints. In other words, component
physical dimensions, interconnect widths, interconnect spacing, and
clock and power/ground distribution are as important as the
design’s functional description
. Right from the early
architectural design stage, we need to generate physical layout
with global routing, interconnect optimization, RLC extraction, and
timing analysis, all with incremental-change capability to
dependably derive the design’s clock-cycle period.
Designers have to document a clear specification of timing requirements that describe the desired clock cycle time, number of cycles, false paths, and multiple cycle paths. Although a designer will revise a timing specification throughout the design process, it is important to propose a blueprint as early as possible. The specification constitutes a guideline of the timing-convergent process. Adding a timing requirement as an afterthought tends to be error prone and takes even more effort to remedy a problem.
Floorplanning searches for a physical-layout solution at a high level of the design hierarchy. Circuit size increases along the path of Moore's law and circuits with millions of transistors are becoming a normal occurrence. On the other hand, the design-issue complexity also grows exponentially due deep-submicron process technology. In order to divide and conquer the complexity, the design is handled in a hierarchy fashion.
Interconnect architecture planning is important at the top level
of the hierarchical design. Because of the scale of
system-on-a-chip (SoC) designs, floorplan variations have a
significant effect on the timing of interconnect between functional
blocks. The design of clocking strategy, communication protocols,
number of buses, bus organizations, and functional partitioning
along with the buses is strongly affected by the design’s
physical structure. Designers thus need to search at the
floorplanning stage for the best choice under the constraints of
timing, routability, and power dissipation. The floorplanning
routes the topology and defines the spacing of clock, power/ground,
and bus interconnect lines. Therefore, the impact of clock skew, IR
drop, and signal delay is judged at the floorplanning stage of a
design. One conservative approach even fixes the wires at this
stage
This approach ensures the timing
of the critical nets with consideration of all details, including
crosstalk.
For timing-driven floorplanning, we have encountered enormous obstacles trying to handle arbitrarily sized blocks with artificial virtual pins. The problem is more than assembling puzzles since along with shape and area considerations there are other constraints, such as interconnect and timing. To simplify the block arrangement, one practice is to limit the layout into a slicing structure. For a slicing structure, we assume that the topology of the floorplan can be defined by a recursive slicing of the chip area with straight horizontal or vertical lines. For example, the whole chip is first divided into two rectangles by a vertical straight line. Then, each rectangle is divided into two rectangles by a horizontal line, and so on. The simplification allows the use of elegant algorithms with efficient execution time.
On the other hand, in order to achieve leading-edge competitive designs, a non-slicing structure is the way to compact the chip and optimize performance. The non-slicing structure covers the topology that cannot be divided by straight lines recursively as you can do with the slicing structure. One important issue of a non-slicing structure is its representation. A conventional method represents the physical relation between adjacent blocks with horizontal and vertical constraint graphs. However, the manipulation of the constraint graphs complicates the design’s implementation.
Recently, Murata et al
introduced a sequence-pair
representation that describes the floorplan with two sequences of
block labels. This method simplifies the representation, but
results in a high-complexity translation between the representation
and the floorplan. Furthermore, the representation intrinsically
covers a huge redundancy. Many representations correspond to the
same floorplan. Following up the sequence-pair introduction, Guo et
al
devised a rooted ordered-tree
representation, which is concise and efficient in terms of
redundancy and transformation. This method demonstrates performance
that is competitive with the quality of manual layout in terms of
area and wire length. With all these developments, floorplanning
becomes an excellent research topic to explore.
Buffer insertion and gate sizing are two effective approaches to improve interconnect delay. However, buffer insertion or deletion changes the netlist. The design system needs to account for this insertion/deletion effect in the design’s hierarchy.
We can also take advantage of equivalent pins to improve the
timing
. This operation does not
involve gate changes; instead, the operation identifies the set of
pins that can be swapped with the equivalent logic function.
Usually, the technique of testing is adopted to find the cluster of
equivalent pins. In large systems, we observed many AND, OR, or XOR
logic functions with a large number of inputs termed as
super-gates. We implement these functions with trees of smaller
gates. Note that the super-gate structure may be obscure due to a
logic transformation, such as DeMorgan's rule. Once we identify the
super-gate, we can swap the pins to improve the timing.
The set of equivalent pins of super-gates covers not only the primary inputs of the super-gates but also the internal pins. The swapping involving the internal pins changes the tree structure of the super-gate to improve circuit performance.
Placement Quality
For the choice of placement approaches, quality comes before other
features. Placement quality greatly influences timing and
routability. Interconnect length cannot be shorter than the
physical distance between the two components. Thus, it is not wise
to push for execution-time efficiency and sacrifice design quality.
On the other hand, the placement algorithm needs to be fast enough
to respond to the request of other operations so that the placement
does not become the bottleneck of the design process.
Timing-Driven Placement
Timing-driven placement optimizes the timing with a routability
constraint. Timing is determined by the longest and shortest path
delays between chip pins and registers, and registers-to-register.
One mathematical approach is to adopt the mechanism of the Lagrange
multiplier (shadow price) of the constraints. The subcircuits that
violate the timing constraints are assigned heavier weights. The
weights of timing together with the weights of routability and area
congestion provide a placement metric to search for a better
solution. The net weights are revised dynamically to reflect the
overall effect on the timing constraint as placement changes. When
the iteration converges, the derived solution is optimal.
The success of timing-driven placement relies on the quality of the placement and the accuracy of the timing model. The quality of the placement is crucial to the speed of the circuit. If the placement cannot allocate the space and interconnect well, there are no space resources left for other optimizations to manipulate. This leaves little hope to recover circuit speed by other processes.
In terms of the accuracy of a timing model, if the timing model
“misses” design-information details, such as routing
topology, capacitance coupling, false paths, or the effect of
perturbation on neighboring objects, the timing-driven effort may
prove useless. On the other hand, because the timing-driven
placement perturbs placement at the expense of wire length and
routability, longer wire length and higher routing congestion is
also detrimental to circuit speed. A study
shows that area cost vs. speed
for interconnect optimization is a function of diminishing return.
As the signal delay decreases, it costs more routing area and
buffers to gain lesser signal speed. After a certain point, it
costs a lot of resources for very little gain. Furthermore, the
added buffers and routing area may require incremental placement to
adjust the layout. Suppose the delay change due to placement
perturbation is larger than the estimated gain. For this case, the
extra routing area and buffers may not help the signal but instead
cause congestion and delay of other signals.
Incremental Placement
Incremental placement is necessary not only for engineering changes
by designers but also for the optimization process. During logic
optimization, clock and power/ground distribution, and interconnect
optimization, as we insert buffers, size gates, and adjust routing
space, area congestion and routability change. We need to revise
the placement to accommodate these changes. Although designers want
to minimize the perturbation in incremental placement, it is
important to keep the timing-driven objective function as a guide.
The incremental placement has to tie strongly to the optimization
process. If the overall effect of the optimization after
incremental placement is worse than before the placement, we should
recover the design prior to the incremental placement and pursue
other optimization operations.
Artificial Constraints
It is important to keep the true object function and communicate
among the optimization processes. However, there are cases, due to
the lack of an established mechanism, where people adopt artificial
constraints and hope the constraints can help achieve certain
design goals. For example, region constraint is used to keep a set
of components within a specified area. If this set of components is
connected with signals critical to the clock cycle period, the
region constraint helps in the sense that it enforces this set of
critical signals within the dimension of the region. Corresponding
to region constraint, a group constraint gathers the set of
components within a certain distance but without specifying an
exact region. Since the placement is a complicated task, a
well-planned region constraint can usually help the tool to find
better placement solutions. On the other hand, an ill-informed
constraint could be detrimental because the region constraint does
not represent an exact timing requirement. As the research and
development of timing-driven placement progresses, we will see less
dependency on such artificial constraints. However, it will take a
long time for these constraints to disappear unless the design
tools can exactly comprehend all of a designer’s
requirements.
Macrocells and Datapaths
A mixed-mode placement of macrocells, datapaths, and standard cells
remains a research problem. The problem involves placing blocks of
arbitrary sizes and shapes, a huge number of standard cells, and
embedded datapath structure with certain arrangements favored by
the designers. In practice, we fix the macrocell before performing
the standard-cell placement. However, as the number of macrocells
increases, the complexity of the whole process increases. There is
room for further circuit improvement if we can perform placement of
all components simultaneously.
It is desirable to arrange a datapath block according to the flow of the data to optimize circuit speed. The datapath information may be passed from a module compiler or extracted from the netlist according to certain rules. A good placement tool can usually cluster components in a datapath according to its structure. However, the tool may not be able to organize the datapath into structured arrays because of interaction with other components outside the datapath. On the other hand, array assignment is not always the optimal solution for a datapath. First, the datapath netlist may not be regular due to an automatic logic-synthesis process. Second, the datapath may be placed in an odd shaped area with no room for an exact rectangular shape. Third, the datapath may need to communicate with sources and destinations spread in different places. An exact rectangle placement may be good for internal interconnect, but harmful for such external communication. Therefore, datapath placement remains an interesting research topic.
Global Routing
Global routing is a popular approach for routing-congestion
analysis because of its fast turnaround time and very high
correlation to actual routing results. In contrast, theoretical
statistical analysis can also be an efficient method. However, its
accuracy remains an open issue. Detailed routing provides the exact
solution, but the details involved make incremental placement
improvement extremely difficult.
The state-of-the-art approach to global routing is the multiple commodity flow algorithm. The routing of each net is conceived as a flow of each commodity. The cost of the routing is revised dynamically according to the congestion of the routing area. The router performs a rip-up and reroute iteration to minimize congestion. From the multi-commodity flow formulation, we can identify an upper bound of the solution under certain assumptions. In practice, an elegant implementation can find results approaching the bound.
The model of global-routing capacity is crucial to the accuracy of the routability estimation. A simple approach is to count the number of free tracks over a routing region as the routing capacity. However, not every interconnect wire is comprised of straight wire and take only a single track. A wire-segment jog or via placement to change layers affects at least two tracks of the routing resources. On the other hand, two small obstacles on two tracks may still allow a wire to go through, with detours.
According to current practice, if the congestion sparsely spreads over the chip, the detailed router will be able to complete the routing. However, if the congestion region is heavily congregated, the detailed router will have a difficult time trying to complete the routing.
Interconnect Optimization
Interconnect optimization involves the routing synthesis and buffer
insertion. At the global routing stage, we search the tree
topology. Buffer insertion dissects the tree topology into sets of
smaller trees. For this situation, it helps to relay the topology
defined by the global routing to the detailed routing.
The insertion of buffers has benefits in three different
aspects. First, an added buffer on a branch shields the heavy load
downstream on the branch from the rest of the tree. Second, buffers
can be used to recover the slope of the signal’s transition
edge and screen out the noise. Third, a designer can use buffers to
boost driving power and reduce delay. A dynamic programming method
can cover all situations and find an optimal solution with
reasonably efficient execution time. In Interconnect Analysis and
Synthesis
the authors describe a
Permutation Tree algorithm to combine topology and buffer-insertion
searches. The method finds optimal solutions with constraints on
the order of the tree topology. However, the complexity of the
method is high for extra large nets.
Due to limited layout space, interconnect optimization has to balance routing congestion and interconnect performance. Using a primal and due formulation, we can set shadow prices for the congested area and nets on critical-timing paths. Interconnect optimization calls a rip-up and reroute procedure to reduce the unified cost function.
. However, the application of
delay analysis to measure quality of partially completed
interconnect during the routing process remains an
issue.
For a fast static analysis, gate delay is usually formulated as
a function of input slew and output load capacitance. In order to
account for the shielding effect of interconnect resistance,
analysis uses a C-effective method to iterate between the exact RC
tree of the interconnect and an effective load capacitance
. The analysis tool then uses
the effective load to pin down the delay model of gate delay. The
tool then applies this delay model to the exact RC tree at the
input. The derived delay of the RC tree is then fed back to adjust
the effective load capacitance until the iteration converges.
To estimate delay due to the cross talk, interconnect coupling-capacitance is converted to self-capacitance with different weights for shortest and longest delay. The contribution of crosstalk to gate delay is complicated due to nonlinear models of active devices. The calculation of delay incorporating crosstalk of very large circuits remains a research issue.
Clock synthesis involves topology synthesis, routing, and buffer insertion. The topology synthesis places and clusters the registers, along with defining the level of the distribution and the topology of the clock distribution network. Buffer insertion involves finding the buffer sizes and buffer insertion locations. Clock routing makes connections according to the clock’s topology.
The goal of clock distribution is to achieve the design’s clock-propagation and clock-skew budgets with signal-delay variation as an objective function of the cost of routing space and buffer area. Delay variation is due to many factors, including wire-characteristics variation due to routing and fabrication, gate-response variation due to fabrication, thermal distribution, supply power changes, and noise. It is important to incorporate all these effects when deriving the clock distribution network.
The ultimate goal of clock distribution is to satisfy the design’s hold and setup time constraints. You can even allocate designated skew to maximize the tolerance of delay variations. Given a combinatorial block, the shortest signal delay from its input to its output defines the hold time constraint between the register at its output and the register at its input. The longest signal delay defines the setup time constraint. Hold and setup time constraints determine the lower and upper bounds of clock skew between the pair of input and output registers.
We define the designated skew by shifting the clock skew to the middle point between upper and lower bounds. The designated skew maximizes skew tolerance. However, the designated skew of one pair of registers affects the setup and hold time constraints of the registers adjacent to this pair. We can propagate the skew effect up to the chip pins or until the propagation paths close into loops. Suppose we formulate this problem as a directed graph, each node of the graph representing a register. The setup and hold time constraints between two registers are represented by a forward edge and a backward edge between two corresponding nodes, respectively. The shortest loop of the graph determines the optimal solution of the designated skew.
Clock distribution is strongly influenced by other synthesis processes such as logic optimization, interconnect optimization, and placement, since circuit performance determines setup and hold time constraints. Furthermore, register and buffer allocation compete with other components for placement resources. Consequently, clock synthesis involves more than the clock buffers and clock topology. You need to tightly couple clock synthesis with the rest of your synthesis processes.
Throughout the design process, required information may not be complete. We need to organize the package so that each tool can operate correctly with incomplete information. For example, before global routing, you do not know a design's interconnect configuration. Interconnect timing analysis uses the best available knowledge to derive a proper model to calculate the delay using certain buffer insertions. Once the global routing is complete, timing analysis should take into account actual routing data to calculate delay.
We acknowledge the grants from NSF 9987678 and the California MICRO program.



