News & Analysis
Configurable Platform-Based SoC Design Techniques, part 1
Bill Cordan, vice president, SoC services and Jon Udell, senior software engineer
3/12/2001 12:17 PM EST
Palmchip Corp.
www.palmchip.com
Introduction
System-on-chip (SoC) developments are under demand from two sides. Market pressures such as increasing features, decreasing time-to-market and prices, and narrowing windows are combined with technology pressures including increasing complexity, gate count and clock speed. Along with reduced power requirements, such developments are forcing greater intellectual property (IP) reuse to be considered. However, block reuse is no longer enough to meet these challenges--more and more, designers are turning to platform-based SoC design methodologies to reduce time-to-silicon (a more measurable result than time-to-market). Application-specific SoC integrated platforms can speed development time considerably at a cost of reduction in uniqueness and value-added market-differentiating features. To alleviate this problem and bring added value to SoC products, configuration of IP blocks and platforms is going to be needed.
Evolution of platform-based design
Initially, ASICs were used to replace glue logic. These were assembled manually as a schematic at the transistor or gate level with almost no reuse of previously used logic or functions. With the rise of RTL languages such as Verilog and VHDL, EDA tools emerged to simulate and synthesize logic from an RTL description to a gate-level netlist. Reusability emerged in cell-based libraries and portions of reusable HDL code. The ability to reuse HDL functional code from one design to the next led to the beginnings of a block-based design methodology. Blocks could be described in RTL, synthesized into gates and laid out in a physical implementation as virtual components (VC) referred to as soft, firm or hard cores.
The advancement of process technology approaching 0.13 micron from 0.5 micron only a few years ago has opened up a significant number of new applications that can be integrated onto a single chip. Complexities of sub 200,000 gates are now moving to 1 million-plus gates with 10 million gates in sight. It would be a challenge to simply maintain the design cycles of 12 to 18 months of a few years ago with this increased complexity. However, demand in consumer and communications products for new features and capabilities is driving market windows down; the upshot is that those 12- to 18-month design cycles are now approaching eight to 10 months with derivative products requiring even shorter introduction times. Consumers are demanding more functionality in smaller packages at a lower price, which is yielding to the requirement for full systems to be integrated onto a single chip, known as system-on-chip, or SoC.
The definition of SoC as stated by the market-research firm Dataquest Inc. is an embedded processor, memory and a minimum of 100,000 gates of logic to that of a complex IC that integrates the major functional elements of a complete end product into a single chip or chip set. Typically, an SoC product contains at least one embedded programmable processor, on-chip memory, additional functional blocks with off-chip interfaces to memory and real world communications framed with an SoC integration architecture or busing scheme. SoC designs include both hardware and software components that together implement the desired system functionality. Examples of SoC applications include cellular phones, PDAs, set-top boxes, portable consumer and Internet appliances, automotive engine controllers and network switches.
To meet the demands of SoC, reusability must encompass greater amounts of IP. Block-based reuse has yielded to subsystem reuse, and platform-based reuse is coming on. Platform-based design offers high productivity through extensive intentional reuse of known verified VCs that have undergone integration as a base SoC integration platform. Using this platform, or application-specific SoC integration platform, follow-on derivative products are created by adding or replacing the blocks that implement the derivative feature sets.

To be effective, a platform-based methodology has to bring a significant element of plug-and-play to follow-on developments. That is, it is necessary to minimize any redesign of the base platform elements or add-on cores to achieve new follow-on products. To achieve this, platforms must have been built on a foundation of the following elements: VCs designed to a standard interface that interconnects cores to the platform. An efficient SoC integration architecture serving as the backbone of the platform to which VCs are integrated into the system. A method of configuring architectural components as well as VCs through parameterization that minimizes or eliminates the need for manual modification. High-level or behavioral level models to facilitate both system verification and hardware/software codesign. A system-level hardware prototyping environment to evaluate, verify and develop VCs and embedded system software.
Interfacing and buses
To connect blocks or VCs into any integration platform, the first requirement is to have those blocks meet a standard interface. The industry consortium known as Virtual Socket Interface Alliance (VSIA) was established in 1996 to develop or adopt standards for designing and integrating reusable IP blocks, which we now know as virtual components. It has proposed a standard VC interface (VCI) that allows peripheral blocks to connect to on-chip buses. It became apparent early in the working groups efforts that either selecting an existing bus or defining a new standard bus would not be a viable approach because of adherence to existing buses and legacy cores that tie to those buses. Therefore, the working group decided to define an interface rather than a bus that can be used as a point-to-point connector or as an interface to a bus interface module. As this discussion proceeds, it will be apparent that although a standard interface can be useful, it is not the full solution necessary to implement an effective integration platform. An SoC integration architecture should be able to accommodate cores that meet the popular bus interfaces such as VCI, ARMs AMBA and IBMs CoreConnect to facilitate plug and play of legacy cores. Some performance trade-offs may be necessary to allow connection of legacy interfaces; however, the time saved by that convenience may be acceptable and must be evaluated for each individual case.
VCI requires a wrapper to connect the VC to the VCI and additional logic to connect the VCI to the interface logic of the on-chip bus scheme being used. Fig. 2 illustrates this concept.

There are three complexity levels for VCI: Advanced VCI (AVCI), Basic VCI (BVCI) and Peripheral VCI (PVCI). Currently, the VSIA has only defined the basic and peripheral VCI standards; the AVCI will be published at a later date. More information on the VCI standard can be obtained from the VSIA Web site www.vsi.org.
It is important to remember that the VCI is an interface rather than a bus or integration architecture. Though it specifies protocols for the transfer of requests and responses between VCs, it does not address areas such as bus allocation schemes or competing for the bus. These are generally addressed by the bus or integration architecture and any effective platform-based development should standardize on an integration architecture that will effectively meet and expand upon the SoC application platform requirements.
One of the most common methods of communicating between VCs or blocks on-chip is through buses. As applications moved from system-on-board to system-on-chip there was a tendency to migrate board-level bus specifications and protocols to the chip-level application. However, that ignores the advantages offered by chip-level implementation and carries over some of the restrictions that were inherent on the board. At the board level, a key concern is minimizing the number of bus signals because pin and signal count translates directly into package and printed-circuit-board costs. A large number of device pins increases package footprint and reduces component density on the board. System-level buses must support add-in cards and pc board backplanes where connector size and cost are also directly related to signal count. This is why traditional system-level buses use shared tristate signaling and, often, multiplexed address and data signals.
On-chip signal routing consumes silicon area but does not affect the size or cost of packages, pc boards or connectors. Todays chip-fabrication technologies offer multiple layers of metal interconnect at little additional cost over the base fabrication process not usually the case with pc board interconnect layers that do significantly affect board cost. That being the case, multiplexed signals can be separated and expanded to unidirectional signals on chip without a major cost penalty. In addition, the capabilities and limitations of logic synthesis tools used in chip development directly affect design time and performance and must be taken into account. It is of little value to achieve the lowest possible routing overhead if design time balloons and the market window is missed. Synthesis tools find it difficult to deal with shared tristate signals with several drivers and receivers connected to the same trace. Static timing analysis tools can also have difficulty dealing with these matters as well. All of this takes time and effort without adding real value in terms of device functionality or features.
Multisourced and -sinked buses add loading that can limit performance; moreover, the verification problems associated with bus loading can lead to a conservative design whose performance falls short of the inherent technology capabilities. Increased loading, bus contention and methods to prevent floating bus signals such as bus keepers not only have an impact on performance and turnaround delays but also will significantly affect chip power. And as SoC designs get larger, unnecessary power consumption affects performance, reliability and system power budgets.
The on-chip world has a significantly different set of design constraints and trade-offs compared with the board-level environment. A bus designed for use on pc boards will not provide the most efficient on-chip solution. What is needed is a completely new bus architecture optimized for SoCs. Key concerns are performance, design time reduction, ease of use, power consumption and silicon efficiency.
The SoC integration architecture
Any processor-driven SoC product requires a number of architectural functions. These include timers, DMA engines, interrupt controllers and memory controllers. In many cost-sensitive applications, a shared memory structure can be utilized to reduce memory component costs. An architecture is needed that addresses the memory needs of all devices without severely degrading the performance of any single device and yet offer flexibility to address a variety of architectures to support a wide range of applications. A proposed integration architecture should display the following attributes:
- Foundry, processor and technology independence
- Centerd around shared memory
- Flexible to address a variety of SoC architectures
- Modular for a plug-and-play modification environment
- Easily synthesizable and works with standard design tools
Platform-based SoC design should not offer a burden when directed to different foundries and fabrication process rules. If the product has to be recoded to support another library, one of the major benefits of platform-based design is lost: time-to-market. Processor independence allows derivative applications to embed a processor that best fits that applications requirements. A processor-centric architecture makes this difficult; a memory-centric architecture reduces the problem of embedding a new processor typically to that of replacing the processor local bus bridge, usually only a matter of a few hundred gates.
The flexibility of the architecture allows derivative platform designs to change the number and type of peripheral blocks as well as the type of processor supported, for example a Von Neumann vs. a Harvard-type processor. Modularity is key to making derivative changes efficiently and should provide a plug-and-play development environment so that derivative platforms are capable of being spun off relatively quickly. Obviously, if the architectural components are not able to work efficiently with todays design tools and environment, efficient derivative designs will not be possible. This means that common bus attributes such as tristating, dual-edge clocking of signals, bus keepers and complex signal protocols make efficient use of design tools difficult.
One SoC architecture that has been offered to meet these criteria Palmchip's CoreFrame SoC Integration Architecture. It was designed with a blank sheet specifically to optimize it to SoC development and performance, rather than migrating a motherboard and bus model. As such, concerns such as routing and addressing that are important in motherboard design become irrelevant, while on-chip ones such as simplified design and interfacing can be optimized. The architecture does not use the traditional bidirectional bus concepts, which eliminates the need for tristate bidirectional bus drivers. This enhances performance and simplifies on-chip design and verification using standard ASIC design tools.
Communication takes place through "channels" rather than on generic buses. The channel hardware transparently handles address and speed differences among various IP modules, allowing virtually any core to be used by simply providing a channel interface socket, which handles protocol, clock domain, bursting and bandwidth matters. Cores plug in to sockets in the CoreFrame architectural model. The socket channel model is set up to keep to a basic ASIC development flow and tools, which simplifies connecting IP modules into the architecture. DMA communications, CPU instruction and data fetching take place on separate channels, allowing independent high-speed data movement without tying up the CPU bus. Each peripheral appears to software as a FIFO, a relatively simple interfacing standard that facilitates quick and easy construction of the system. The channel-based approach can accommodate multiple clock domains through synchronization FIFOs to allow speed matching without loss of throughput.

Channels are typically reserved for high-performance data transfers. Other communications can occur within the SoC platform architecture such as configuration and control of the SoC functional and architectural blocks; block status register reads; and low-speed peripheral transfers under CPU control. This is usually handled on a slower peripheral bus using a simple protocol (that is, nonbursting). CoreFrame implements this requirement with its PalmBus, which is a single-master, multiple-slave bus where the master is the PalmBus controller that bridges the PalmBus to the CPU. This bus still needs to follow the same basic attributes of unidirectional, single-edge clocking to allow easy implementation within a standard ASIC design flow.

Architecture topologies
Processors are typically available in two architectural types: Von Neumann and Harvard. The Von Neumann architecture processor shares the same external buses for instruction fetches and data operations. The Harvard architecture processor utilizes separate buses for instruction fetches and data operations. Most DSP designs are based on the Harvard architecture, but a chosen SoC architecture should easily adapt to either one or to multiple processors of either or both types.
The most common application architecture embeds a single Von Neumann processor with application-specific peripherals. Typical applications may include games, organizers and network controllers and appliances. Because the Von Neumann processor uses the same bus for instruction and data operations, the processors external bus is connected to both the PalmBus controller (for access to the peripheral blocks) and to a cache or MChannel bridge if no cache is needed for access to shared memory. Memory accesses may be either for data or for instructions.

When large amounts of time-critical data must be processed, a system can be implemented with a single Harvard architecture processor. Applications where such an implementation is advantageous include image processors and servo controllers. With the Harvard CPU, both of the processors external buses have dedicated memory channels. The PalmBus controller handles configuration, and control is connected from the CPU data bus (since the processor never fetches instructions across this bus). Both instruction and data buses are connected to a cache (or to an MChannel bridge if no cache is needed) to access shared memory. When compared to the Von Neumann implementation, an additional channel is added to the memory access controller for the CPUs second bus; no other changes are needed.

Many systems require both time-critical data processing and a host controller. In these systems, a dual-processor implementation can be advantageous. A Von Neumann processor is used for control functions, since these processors are more compact than Harvard ones; a Harvard architecture processor is utilized for data processing. Dual-processor applications may include cellular phones, digital cameras and graphics processing. If peripherals used by the control processor are independent of those used by the data processor separate configuration and control buses (PalmBus) can be implemented. Connections to the processors are identical as described in the previous Von Neumann and Harvard implementations.

In the above example, either processor can access peripherals. Thus, two PalmBus controllers are implemented with arbitration to a single shared PalmBus. Access to shared peripherals is controlled using software--for example, via semaphores.
Architectures for increasing bandwidth
The architectural implementations shown above are basic and fundamental architectures that perform well in most SoC applications because channel-based, memory-centric architectures are excellent in meeting low-power requirements without sacrificing throughput. Their ability to handle applications that require increased bandwidth and multiple on-chip data transfers are limited. Such applications as networking, routers, hubs and switches often need greater bandwidth and often need to transfer data over simultaneous paths, however demand for high-speed, multi-path data transference must not starve any embedded control or host processor for its instructions.
Channel-based platform architectures can support a switching fabric on-chip that allows multiple DMA devices (and processors) to communicate with multiple output channels simultaneously. These output channels can be connected to external memory, internal memory, or Non-DMA blocks. An on-chip switched fabric allows access to functions such as the memory controller connecting to external memory as well as routing on-chip resources such as on-chip memory and non-DMA devices to DMA, CPU, or DSP devices.

The example shown above illustrates a more complex dual processor system with an on-chip dual-port RAM. In this implementation, the CPU can execute from flash while simultaneously processing data from a DMA peripheral in the SDRAM. The DSP can at the same time process data from the dual-port RAM while another peripheral is transferring data to or from the RAM. Using a channel-based, socketed architecture, no changes to any blocks are necessary except for the insertion of a switched resource router needed for the processors and DMA peripherals to take best advantage of all available bandwidth.
In the limited set of examples, it can be seen that a flexible SoC integration architecture can easily accommodate system platform requirements without sacrificing shared resources or cost. In addition, such flexibility allows designs to be ported to a wide variety of applications without modification to the peripheral blocks or processor subsystems. Interconnect modifications necessary to meet system requirements should be minimal and preferably automated, making it an excellent platform upon which to base many different SoC designs.



