News & Analysis
600-MHz DSP handles more channels, prepped for 3G basestations
Tod Wolf, Dale Hocevar, Alan Gatherer, Texas Instruments, Dallas, Texas, Patrick Geremia,Armelle Laine, Texas Instruments, Nice, France
5/13/2002 7:22 AM EDT
Viterbi and turbo decoder coprocessors have been added to a 600-MHz DSP. This approach, to be presented at the Custom Integrated Circuits Conference, can support up to 350 voice channels and/or 28 data channels for baseband processing at 718 milliwatts. The following article contains excerpts from the paper, titled "A 600-MHz DSP for Baseband Processing in 3G Basestation," and is reprinted with permission from IEEE 2002 CICC.
The explosive growth in wireless cellular systems is expected to continue with the introduction of 3G systems that will integrate significant amounts of data communication with voice communication all at higher user capacities than previous systems. Due to their increased computational requirements, 3G base stations are more difficult to build compared to 2G. The increased computation is due to more complex algorithms and higher data rates, and the desire for more channels per hardware module.
For the 3G base station architecture, a cost -effective and synergistic solution may lie in using a 600 MHz DSP with two coprocessors: a Viterbi decoder (VCP) and a turbo decoder (TCP). This solution is truly a system-on - a- chip. The concept is to use a coprocessor when there are regularized functions that can be realized with very high silicon efficiencies relative to the DSP. Another feature is to incorporate a high degree of flexibility into each coprocessor so that it can be used as a platform for multiple basestation solutions developed by multiple OEMs with differing requirements. This allows each DSP to handle a larger number of channels and/or to incorporate advanced algorithms.
The DSP can process eight 32-bit instructions per cycle or 4800 MIPS. The DSP core is based on the VelociTI very-long-instruction-word (VLIW) architecture used in the C62x core, but also includes an enhanced instruction set architecture. The DSP is 100% source code compatible and object code compatible with the C62x DSP core. It supports four 16-bit multiply-accumulates (MACs) per cycle and new specific instructions for imaging and communications algorithms.
The DSP uses a two-level cache based architecture and has a powerful and diverse set of peripherals. The level 1 instruction memory is a 16 kbyte directed-mapped cache and the level 1 data memory is a 16 kbyte 2-way set-associative cache. The level 2 1024 kbyte memory/ cache is shared between program and data space. The peripherals include two external memory interfaces (EMIF), an enhanced direct memory access controller, host port interface, three multichannel buffered serial ports, three 32-bit general purpose timers, and a PLL.
A Viterbi decoder is typically used to decode the convolutional codes used in wireless applications. The algorithm is comprised of two steps: Computing the state or path metrics forward through the code's trellis and using the stored results from step one, traversing backwards through this data to construct the most likely codeword transmitted known as traceback.
The state metric unit performs the forward computational part of the Viterbi algorithm and consists of performing the add-compare-select (ACS) operations on the states moving forward over the trellis. The ACS operations are typically done by working on pairs of states which form the butterfly structures of the trellis. The operations consist of adding the previous state metrics (SM) to the respective branch metrics (BM) and at each next state node selecting the maximum value for the new SM and saving the decision bit that denotes which branch was chosen. The set of branch metrics for a stage in the trellis are derived from the received data usually at one time point. A trellis stage is a column of butterflies in the trellis.
The VCP uses a cascade structure of four ACS units capable of operating on a radix-16 trellis. We use this cascade as a higher radix ACS unit coupled with a state metric memory to allow operation on trellises up to 256 states; this represents a different utilization of the cascade ACS structures than previous approaches. In addition, a unique form of register exchange is incorporated in the structure to achieve partial pretraceback over the active length of the cascade.
Typically, Viterbi decoders are designed to solve only one, or very few, code structures. Thus, very simple solutions exist for selecting BM for each pair of ACS operations. However, for a generalized decoder, one allowing arbitrary code polynomials, multiple code rates and multiple trellis sizes, this selection problem is non-trivial. This is compounded further when using a cascade structure of length four since then there are four simultaneous selection operations required, each with a unique order.
The problem of BM selection depends upon the code polynomials and the state indices of the trellis butterfly for the current ACS operation. A BM index for each ACS operation can be generated which selects from a small set of BM for each trellis stage. This BM index is simply the bits that result from applying to the encoder one of a butterfly's input state indices and setting the (fictitious) input bit to the encoder as 0 or 1 depending upon which of the two state indices is used.
In the hardware implementation for this process, each BM selection unit selects and distributes BMs for two ACS units in the cascade data path. Each selection unit receives as input the state indices in the correct order for the first stage of the cascade and must transform these indices into the correct state indices for each of the two cascade states it serves.
To find the BM index from one such state index, the hardware essentially implements the encoder function with the user supplied code coefficients and the hypothetical input bit. The set of BMs for each cascade stage are stored in holding registers and the BM is selected for each butterfly using the necessary BM index and is then routed to the correct ACS unit.
The traceback process involves moving backwards over the stored path decisions from the SM update process. The process starts with a particular state and constructs the path back to the start of the trellis by obtaining the related decision data from the traceback memory. The output bits are produced during these operations. The VCP, because of the pretraceback done in the cascade unit, can move backwards in steps of 4, 3, 2, or 1 trellis stages depending upon the constraint length and the phase. This allows the overall traceback process to oper-ate effectively at a slower rate. To simplify hardware, all pretraceback nibbles are treated as four bit items, though to reduce I/O frequency they are always stacked into a 32 bit memory word.
The top level architecture of the VCP consists of three major units: state metric unit, traceback unit and DSP interface unit. The state metric unit can perform 600 x 10 6 ACS butterfly operations per second, and the VCP can decode at a rate of 4.7Mbit/s. This is equivalent to well over 350 voice channels for 3G wireless systems. The VCP 64-bit interface to the DSP operates at 150 MHz, the majority of the gates at 150MHz and the state metric unit at 300 MHz. The VCP has 52K gates and 57K bits of RAM.
The VCP allows any puncturing pattern and has parameterized methods for partitioning frames for trace-back, so that frame size essentially does not matter and the convergence distance can be specified for partitioned frames. Thus, the VCP implementation can decode virtually any desired convolutional code found in the 2G, 2.5G, and 3G wireless standards.
Flexible control allows the TCP to be configured to work in several modes. In the conceptually simplest mode the DSP loads an entire block of data to the TCP. The TCP iteratively decodes the block. Each iteration consists of two MAP decodes. The first MAP decode processes non-interleaved data and the second decode processes interleaved data. The TCP controller is in charge of writing the correct systematic, parity and prior data to the MAP decoder. After successful decode the DSP will retrieve the corrected data. The TCP uses the sliding window technique that breaks the block into several smaller blocks. Each smaller block can be processed independently with the addition of a small prolog section. This type of architecture reduces the memory required to store the beta state metrics by 90% at a cost of 9% more cycles.
The MAP controller can configure the MAP decoder architecture to simultaneously perform alpha and beta updates as well as the output update from the extrinsic block. The alpha block processes the sliding window in a forward direction and the beta block processes the sliding window in a backward direction. As is usual in turbo decoders, the iterative beta calculation is performed first and then the iterative alpha calculation is performed at the same time as the extrinsic calculation is performed, using the latest alpha output as well as the previously derived betas. Therefore, we need beta storage but no alpha storage. A pipelined architecture allows four beta blocks to be generated in parallel with four alpha and four extrinsic blocks. By this technique, eight independent sliding windows can be processed simultaneously and this gives the design a high throughput.
The final design is capable of processing 28 channels at 384 kbit/sec at a rate of 11.3 Mbit/sec for 8 iterations. Though this is more than the capacity of most basestations, it allows the turbo decoding to occur with low latency, which is a desirable requirement in the overall system. The TCP can support both 3G wireless standards. It also supports code rates of 1/2, 1/3 or 1/4 . The TCP uses a stopping criteria that generates the SNR of the extrinsics, compares the calculated SNR with a programmable threshold, and stops the decoder when the threshold has been reached. This algorithm saves the TCP both cycles and power.



