News & Analysis

Configurable CPU a boon in router design

Richard Norman, Chief Technology Officer, Hyperchip Inc., Montreal, Jim Turley, Vice President of Marketing, ARC Cores Inc., San Jose, Calif.

1/28/2000 4:14 PM EST

Configurable CPU a boon in router design

When engineers at Hyper-chip Inc. started development on a network router that scales from 64 Gbits/second to more than 1 petabit/s, the traditional approaches of using either hard-coded logic or a standard microprocessor architecture would have created difficult problems.

Although Hyperchip's high-end designs generally require hard-coded logic for maximum performance, in complex areas the technique of freezing functionality into custom logic is both time-consuming and risky. Design requirements can change, forcing extensive rework or expensive respins of silicon. Even worse, new external standards can be established, requiring in-field hardware replacements to upgrade customer equipment.

But while the processor/software approaches allow last-minute design changes and even field upgrades, Hyperchip's extreme performance requirements exceed the embedded processing power per square millimeter available through traditional processor architectures. One of embedded design's most prevalent traditions is the use of microprocessors that were designed originally for RISC workstations. Some of the most popular 32-bit chips today were powering Unix workstations a few years ago. But just because these chips were successful in computers doesn't automatically mean they're suited for embedded applications.

Embedded design often calls for creating something new, different and unique. Yet the basic tradition of using general-purpose processors continues. Processor chips with poor code density and reduced features are now being pressed into service in embedded applications, and the programmer's job is to map the task at hand onto the fixed resources of a processor designed with extremely different trade-offs in mind.

In developing the core ASIC for the petabit router, Hyperchip determined that freezing all functionality into fixed logic would have cost several months of the design cycle, and would have exposed us to unacceptable design-change and standards-upgrade risks. But with Hyperchip's design calling for over 300 pipelines per ASIC operating in parallel, there was no way a standard microprocessor approach could be used either; using 300 of even the smallest traditional processor cores would have ballooned the ASIC die size to an unmanufacturable 40 mm on a side.

Hyperchip therefore adopted a hybrid approach. The most common types of pipelines are the simplest, and they are immune to external changes. These pipelines were thus hard-coded with little effort and risk. The most complex pipelines were fewer in number and vulnerable to both internal and external changes, so a microprocessor-based approach was chosen for those.

Although the hybrid design was great in theory, a number of problems still had to be resolved to put it into practice. Each of the more sophisticated processors had to act as a coordinator for a large number of the simpler, hardwired accelerator pipelines and traditional processors communicating with peripherals through a bus. But sharing a bus among the accelerators would have sapped their performance. Furthermore, even the smallest traditional processor was still several times too large.

What Hyperchip really needed was a configurable processor that could be adapted to our specific design, down to the register sizes, numbers and instruction-set mix.

Ideally, every aspect of the processor would be under our control: the programming model (including register set, condition codes and memory map), the instruction set and the peripheral interfaces. With these degrees of freedom, our designers could literally create the ideal processor for our task and our task alone.

A significant advantage of this kind of approach is that the traditional partition between hardware and software design becomes more of a transparent screen. Programmers can have their favorite instructions implemented-including ones they just thought up that morning-and hardware designers can adjust internal structures and peripheral interfaces to suit the task at hand. A few custom instructions and registers can eliminate the need for a lot of extra hardware. For example, an atomic queue management instruction can reorder packets in multiply-linked dynamic lists without interlock hardware, or quality-of-service algo- rithms can be implemented with only a few dedicated instructions instead of lengthy subroutines.

Once we'd started thinking outside of the box this way, it was hard to go back in. But since creating a sufficiently optimized processor on our own would have added another huge task to our already lengthy list, we looked at the commercially available configurable processor options, and chose the processor and tool suite from ARC Cores.

Almost unlimited freedom

Even with no previous ARC experience, our hardware and software engineers found they could easily create a custom 32-bit processor to suit our specific needs. The ARC verification and co-design tool suite provides almost unlimited freedom for customizing the processor. The fundamentals of the ARC design have been well seasoned and proven in dozens of ASIC designs, so the custom ARC-based processors are correct by construction, and there are even safeguards to ensure novices don't create an unworkable core that can't be built or programmed.

The ARC tool suite includes a hardware configuration manager (the ARChitect) that allowed our engineers to customize and design our own processors, as well as a complete software tool chain (including C/C++ compiler, assembler, linker, debugger and profiler) from commercial software providers. Every design decision an engineer makes modifies both the hardware (the processor itself) and the software (the tool chain) simultaneously, so there are no worries about "breaking" the compiler.

With the ARC approach, our engineers could customize all aspects of the processor: programming model, instruction set, registers and bus interfaces. A basic ARC processor has thirty-two 32-bit registers, but designers can add as many extra registers as they like. More interesting is the ability to add condition codes. In addition to the usual programmer's flags (zero, overflow, negative, carry), engineers can create 16 additional condition codes that can be anything they can imagine: conditions like "set flag if packet has priority greater than 128 or is the last packet in a frame."

Even more exciting was the ability to customize the instruction set. Every ARC implementation includes a basic set of operations (add, move, load/store, etc.) needed by every application. ARC provides a library of common, but optional, instructions such as byte-swap, normalize (count leading bits) and MAC (multiply-accumulate). ARC even offers designers the choice between a small multiplier (which conserves transistors)or a fast one (which improves performance), or even DSP-like saturating arithmetic and 24-bit or dual 16-bit MAC units.

Beyond that, our engineers had a free hand. New instructions are defined using either VHDL or Verilog; with just a few lines, engineers can define the operations of the new instruction, its op code and its mnemonic. A similar C language template defines how the instruction should be called. From then on, that instruction is included in the processor, the assembler, the C compiler, the debugger and the rest of the tools.

As programmers were designing a register file and instruction set, hardware designers focused on such things as bus structure. The basic ARC processor has four buses, each with completely configurable timing and protocol: one for instructions; one for load/store data; one for a back-side interface into the processor core (useful for debugging); and a fourth bus for anything the designer deems necessary. This bus is often used for high-bandwidth, data-streaming applications in networking and telecommunications.

No bottlenecks

In the key petabit router ASIC, Hyperchip's main goals were to tie the accelerators together without performance bottlenecks, and to minimize the processor size so that we could fit one per transmitter, or 32 per chip. While the ARC's four high-performance buses would have been sufficient for almost any design, Hyperchip needed unshared processor access to as many as 16 accelerators. The ARC supported this as well, allowing us to map the accelerators directly to its registers and eliminating the need for buses altogether. The elimination of buses further doubled I/O throughput, on average, and minimized area as well.

Minimizing the size was also mostly a matter of elimination. The uncustomized ARC processor's area was already only one-fourth that of the typical embedded RISC processor. By elim- inating unneeded instructions, Hyperchip engineers reduced the core's already small instruction set from 35 to 16, for an additional 40 percent savings in die area and power consumption/heat generation. Mapping the ARC processor's registers to the physical hardware interfaces eliminated the need for a separate bus interface unit altogether, allowing Hyperchip to pack in 32 ARC cores per chip while leaving plenty of room for its proprietary routing accelerators. We felt like Luddites for using the ARC's fantastic flexibility more for stripping down the processor than for adding to it, but in a matter of days we were able to customize and integrate exactly the ARC processor we needed.





Please sign in to post comment

Navigate to related information

EE Buzz DesignCon

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form