Design Article
Converged telecom/network data center needs redundancy options
Jason Bailis, Product Line Manager, Telecommunications and Embedded Group,Intel Corp., New Jersey
5/20/2002 8:55 AM EDT
Several forces are at play in the move toward using cost effective, commercial off-the-shelf components (COTS) in the new highly available telecom/datacom and networking system servers, routers and switches for the converged voice over IP.
It has been shown that COTS-based systems have the ability to drive down total system costs due to economies of scale and increased interoperability. Nevertheless, the COTS approach introduces some interesting quandaries as well, since solutions still need to satisfy the system availability, quality, and performance characteristics of legacy proprietary installations.
This dichotomy is challenging, but can be accommodated in a fairly straightforward way through the addition of sufficient redundancy.
But determining the precise effect that redundancy has on a system's aggregate availability is often not trivial. Availability calculations that must account for the effects of the redundancy architectures employed (i.e., parallel vs. series), the success rate of failover to redundant components, the effect of the Mean Time to Repair (MTTR) of failed components, and the like - can become a laborious and error-prone endeavor.
The type and amount of component redundancy determines the downtime characteristics of the system. There are eight primary high availability architectures in use today, each with differing methods of supplying the necessary redundancy: clustering, hardware fault tolerance, peripheral hot swap and redundancy, redundant system slot, cluster-in-a-box, integrated peripheral, packet switched backplane, and network routing.
Some of these architectures, such as clustering, are familiar to server system developers, while others are more common to developers who have worked mainly in the network and telecom/datacom services environments.
Clustering is a very common technique and probably the one most familiar to users and developers of Internet data centers or telecom service centers. In a clustering architecture, entire computers or systems are duplicated, so when a system in a cluster fails, the operations on that system are transferred to a spare system. The number of spare systems provisioned may vary from 2N where there is a spare system for each provisioned system to N+1 where there is a single spare for N systems.
The spare systems may be deployed in an active/standby mode where the spare standby is in a ready-to-go, but currently idle, state. A more ambitious configuration is active/active; where all systems, including standbys, are in sync with each other's activities and dynamic load sharing may be possible. Active/active configurations are more difficult to implement, but they provide a financial payback if load sharing can be achieved; when all systems are operating, the total system capacity is maximized and hardware is not just sitting idle waiting for a failure.
The advantages of clustering are that it works with any PC-based system, accommodates the currently dominant PCI form factor, and uses standard network connections to keep systems informed of each other's status and state. Most importantly, clustering accommodates geographic diversity. In the case of a disaster such as a flood, fire earthquake, or even terrorism, clustering allows for continued service availability.
The disadvantages of clustering, as typically cited, include the duplication of costly peripherals (such as media processing boards and Time Division Multiplex (TDM) connections) and the relatively long fail over times (on the order of seconds as opposed to milliseconds for some other approaches). The logistics of resynchronizing clusters after a system failure - often they have to be taken off line in order to restore a cluster to the necessary redundant state - is also a drawback to this architecture.
Hardware fault tolerance is the replication of the CPU processing logic, which executes the same instruction set simultaneously and in lockstep. The outputs from the replicated fault tolerant CPUs are compared to determine if there is a difference in results. Since it is not possible to quickly and efficiently determine the errant CPU if there are only 2 different results from 2 processors, typically, triple modular redundancy (TMR) is employed. TMR, which employs 3 processors, allows for a more effective fault isolation process. If the outputs from one CPU do not match the output from the other 2 CPUs, that CPU is considered errant and removed out of service and online repair can then take place.
The primary advantage of this architecture is protection against hardware malfunction with transparency at the application level. In other words, if hardware malfunction is detected on a set of components; those components can be replaced quickly and easily without any special failover logic required in the application level software.
However, this architecture does not safeguard against software failures - all CPUs will fail simultaneously upon a software failure, which quite often can be the most fragile part of a system.
Peripheral hot swap (PHS) allows the online repair, upgrade, or addition of peripherals in a CompactPCI (cPCI) chassis, without the need to power down the entire system. While peripheral hot swap is effective in reducing the MTTR, it alone does not protect against operational downtime or the time taken to procure a spare and dispatch a craftsperson to make the repairs.
Redundant system slot (RSS) systems provide redundant, hot-swappable single board computers (SBCs) in a cPCI system. Such a system builds on the capabilities of a PHS cPCI system by eliminating the SBC as a single point of failure.
Each SBC has a separate instance of the operating system and the application. The SBCs may be in an active/standby mode where the active SBC controls both cPCI bus segments in the chassis. If the active SBC goes down, the standby SBC takes over the processing of the failed SBC and assumes control of both cPCI bus segments. In the active/active mode, both the SBCs are active and control their own bus segments. However, if one SBC goes down, the other SBC takes control of the bus segment controlled by the other SBC and operation of the system continues.
In the cluster-in-a-box (CIB) configuration, there are two or more logical systems in a cPCI chassis. Each logical system is a self-sufficient whole computer that contains its own independent cPCI and H.110 buses, its own SBC, peripheral cards, operating system, and applications. It is similar to combining clustering and peripheral hot swap in the same solution. The primary advantage of cluster-in-a-box is the cPCI form factor that allows peripherals to be hot-swapped on failure, which can drastically reduce MTTR over clustered architectures.
The benefits of an integrated peripheral are that it is a complete (host + peripheral), standalone computer in a slot. Each peripheral and its host processor are independent of other peripherals, which reside in the same chassis. When a fault occurs, it is isolated to the single peripheral card and only that peripheral and its host need to be restarted or replaced.
Packet backplane configurations introduce a redundant high-speed packet bus right into the backplane of the system for high bandwidth traffic such as control, media, or data. Such a backplane can replace and/or supplement the cPCI bus or TDM bus improving throughput as well as availability. Packet switched backplane (PSB), as defined by PICMG 2.16, overlays a packet-based Ethernet architecture onto the cPCI backplane.
Network routing is an effective high availability configuration that lets service outages be reduced in a very reliable manner as calls can be rerouted to entirely different installations. Some of the techniques used to survive outages in the network include: reserve capacity, system diversity, geographic diversity, size limits, dynamic routing, restoration switching, self-healing protection switching, and others.
The network routing architecture is deployed widely today via the Intelligent Network (IN) build on top of SS7. It is also gaining momentum in the Internet build out as it strives for higher and higher overall availability on par with the public switched telephone network. This architecture has a lot of promise and is an area of continued research in the next generation network .


See related chart
