News & Analysis
Backplane health rests on fault finding
Tom DeLurio and George Hall
10/10/2003 1:01 PM EDT
Backplanes and motherboards responsible for delivering and distributing power to multiple card systems must be immune to individual card failures that could jeopardize reliable system operation. Although backplane designers take many precautions to avoid this mode of failure, particular attention must be paid to system card design in order to isolate failures to that card alone: Failures allowed to spread into adjacent cards or the backplane can easily bring an entire system down. A way must be found to cordon off faults at the source in order to maintain system uptime.
Additionally, a fault must generate an alert so repairs may be made. Intelligent power management must be designed on the system cards to control, monitor and report the health of the power subsystems. It must also include the means of recording faults and generating alerts.
The comprehensive design of a power system requires conversion of the -48-volt bus to multiple voltages typically required by communication equipment such as network processors, DSPs and miscellaneous ASICs. We'll review reliability concerns and then discuss the design of both power-management and power-conversion blocks, focusing on areas where difficulties often arise.
There are many sources and mechanisms of system failure; faults that cause it are classified by origin or duration. A fault characterized by origin may be caused by incorrect design, environmental factors, physical defects or incorrect use (such as operator error). The most common ones are incorrect usage and component mortality.
The duration of the fault can be permanent or transient. A permanent fault generates errors over a period of time coincident with the system's life span; a transient one generates errors significantly shorter than the system's recovery requirements. If a transient fault occurs repeatedly, its detection is desirable. High-availability systems move through a typical sequence of events during system failure and recovery. This sequence consists of fault detection, diagnosis, confinement, retrying or masking, compensation, repair and reintegration.
- Fault detection uses a combination of testing, monitoring and result comparison from redundant operations and occurs either off- or online.
- Diagnosis determines the cause and provides information about the failure location or properties or both.
- Confinement isolates the faulty component from the rest of the system and prevents further propagation of a fault and its effects.
- Retry and masking techniques ensure that only correct information gets passed on in spite of a failed component.
- Fault compensation occurs when the system provides additional responses to compensate for the output of the faulty component.
- Repair and reintegration of the failed component into the system without interruption completes the sequence.
A new approach for increasing reliability is to manage all parts of the system's power chain and standardize on a consistent and modular architecture. In this manner, all the cards and platforms a manufacturer designs and produces can benefit from the data derived from each failure. This approach is now possible using integrated circuits that combine both hardware and software solutions with data outputted on an industry-standard bus. The stringent reliability requirements call for devices capable of monitoring functions related to voltage, current and temperature on the individual cards in addition to managing soft-start, hot swap, reset control, supply sequencing/tracking and voltage control, as well as status monitoring and reporting, fault diagnostic recording, environmental monitoring and active dc output control power management.
This confines problems to an individual card, which can then be safely disabled and replaced before failure without causing system downtime. Further, analysis of failures can be used to refine the card design. An example of the power-management chain is such a system card using a point-of-load (POL) architecture. POL is displacing traditional distributed-power architecture where power supplies are distributed across the board from four or more isolated stepdown (buck) dc/dc converters. Instead, the POL architecture uses a single -48-V isolated dc/dc converter that is hot-swapped into a -48-V supply and outputs a quasiregulated intermediate voltage (+5, +8 or +12 V). The intermediate voltage is then bused to single or multiple nonisolated POL dc/dc converters, switching regulators or low-dropout regulators to regulate and control the supply voltage at the load.
This is a new concept being introduced in products designed for systems such as blade servers and AdvancedTCA platforms that require efficient power management to help increase the reliability of data communications equipment. Increasing the number of cards with the same architecture also increases the statistical sample size, allowing identification and elimination of failures with very low rates of occurrence.
Perhaps the primary challenge facing communications engineers is to maintain system operation during system card hot-swapping. This means the hot-swap function, which was historically focused on individual system cards' passing over power during insertion and removal without powering down the system to allow for easy servicing, must also prevent disruption of other system cards when it malfunctions.
Hot-swap controller
Any board or circuit connected to the -48-V supply must not cause any disturbance to the bus. A hot-swap controller (SMH4804) is used to:
- Permit live card insertion by soft-starting the -48-V live insertion current.
- Shut down the -48-V power on the native board when an overcurrent or other fault jeopardizes the bus or native card.
- Permit orderly power-on/-off sequencing of the dc/dc converters. This includes primary to secondary voltage isolation using optoisolators or other nongalvanic device with the required primary to secondary breakdown voltage rating.
The most basic implementation of the hot-swap function must provide card insertion detection and -48-V soft-start current limiting. It should also provide advanced fault-detection functionality capable of monitoring primary-side voltage for over-/undervoltage conditions as well as the current into the system card. A hot-swap controller for data communications applications requiring isolated dc/dc converters must also address the increasing power requirements and complexity of POL power systems. Programmable analog technology is used so board designers can have flexibility with no need for excessive external components, which affect reliability. Also, most traditional fault-sensing techniques are prone to inadvertent activation during unusual events, such as initial powering of the card or insertion of other cards in the rack.
The new devices ignore spurious events, reacting only to actual faults. These components also allow a system card to be inserted into a live backplane and eliminate any possible disruption of system operation. Upon card insertion, the hot-swap controller monitors the input voltage, ensuring that it is within its valid range, and checks the pin detect inputs for proper card insertion. Programmable delay times ensure that power is not applied during contact bounce.
Detection, turnoffs
The hot-swap controller block turns off the card if a fault is detected on the -48-V side, such as overcurrent or loss of regulation of the primary supply voltage. A forced-shutdown input allows the card to be turned off if a fault is detected on the secondary side. In either case, the fault is isolated from the rest of the system to prevent it from propagating to other system cards. Overcurrent or circuit breaker functions include selectable quick-trip current values and duty cycles. A programmable nonvolatile circuit breaker can be used to prevent power from being reapplied to a card that has previously had an overcurrent fault.
In the example, the hot-swap device controls one converter, enabling it after a programmed time period. A sequence timer input can be enabled, allowing forced shutdown of the -48-V switched source in the event a fault is detected on the secondary side of the dc/dc.
Communication between the secondary side and the hot-swap controller is essential in power-managed designs. The device is programmed through a standardized I2C interface, which allows the designer to optimize the various parameters. This standardized interface allows the device to be programmed in-system, eliminating the need for external programming. Programmable design tools provide customizable options for varying power-management requirements of system cards.
Tom DeLurio is director of applications engineering and George Hall is staff applications engineer at Summit Microelectronics Inc. (San Jose, Calif.).


See related chart
