Design Article

FPGA Synthesis of Custom DSP Blocks Using Distributed Arithmetic

M. Martinez-Peiró,R. Gadea, R.Colom, F. Ballester, and V. Herrero

6/8/2001 12:00 AM EDT

 

 
ABOUT THE AUTHORS

M. Martinez-Peiró, R. Gadea, R. Colom, F. Ballester, and V. Herrero are teachers on Digital Applications at the Technical University of Valencia (Spain). They are currently working on video, custom DSP, HDL simulators, digital filtering, and image lifting.
 

The high density of currently available FPGAs allows designers to develop single-chip FPGA-based DSP blocks. The combination of DSP technology and FPGA implementation is sometimes known as a Custom Digital Signal Processor (CDSP). A CDSP lets you select the precision, inner data width, and word length for a particular application. In this paper, we explore CDSP synthesis using Distributed Arithmetic (DA). Initial studies of DA started thirty years ago. Peled and Liu, Croisier et al. , and Zohar enunciated the theoretical basis of DA and how to use it to design digital filters. Several DA reviews of DA are available. This type of arithmetic takes advantage of technologies rich in memory elements such as RAM-based FPGAs. In these devices, designers can use lookup-tables and adders to compute the inner products of a mathematical computation.

This paper describes a methodology to implement FPGA-based FIR filters. The authors use a VHDL description to design a parameterized, full-parallel, FIR filter macro-model and have validated the design by compiling several versions in Altera 10K50-3 FPGAs. The use of DA for FIR filter design is reviewed and the basic cells used to build the macro-model is described. In addition, the results of the FPGA implementation of the FIR filter are shown.


Distributed Arithmetic
In FIR filtering, you compute inner products with the expression:

where At are the fixed coefficients, T is the number of filter-taps, and xt are the input data words (Equation 2).

where N is the bit-number of the data, xt0 the sign bit, and xt,n is bit "n" of xt.

You can rewrite Equation 1 as:

which suggest the base of the distributed arithmetic. You can pre-calculate the following terms in Equations 4 and 5. Each xt,n takes the value '0' or '1' and n can take one of the 2(N-1) possible values, all of which are possible sums of coefficients At.

    and    

Equations 4 and 5 produce a new expression shown as Equation 6:

You can expand Equation 6 into individual product terms where x1,0 is the bit 0 of the first sample, x2,0 is the bit 0 of the second sample, and so on.

Each term in brackets in Equation 7 is stored in ROM, RAM, or a Look-Up Table (LUT). The 2(N-1) different values represented by these terms are addressed by xt,n. Table 1 shows how the term storage is accomplished.


Address  
xt,N-1 ... xt,2 xt,1 xt,0 Content
0 ... 0 0 0 0
0 ... 0 0 1 A1
0 ... 0 1 0 A2
0 ... 0 1 1 A2 + A1
1 ... 1 1 1 AT... + A2 + A1

Table 1:  A ROM, RAM, or LUT stores distributed arithmetic pre-calculated terms

By accumulating the N values of n with appropriate weighting factors, you can implement digital filters using distributed arithmetic by using registers, memory elements, and a scaling accumulator. Figure 1 represents a serial implementation of a digital FIR filter based on the expression of Equation 6.

Figure 1:   A full-serial FIR filter based on distributed arithmetic

The scaling accumulator adds each n between 1 and n-1, and subtracts the 0 term, which corresponds to the MSB of the input samples. You can also implement his structure in a mixed format (Figure 2), or in a full-parallel fashion (Figure 3).

Figure 2: Mixed-format version of a DA-based FIR filter

Figure 3: Full-parallel version of a DA-based FIR filter

The main advantage of FIR filters is their linear-phase response. These filters achieve this response with symmetrical or anti-symmetrical structure, which also offer hardware reductions. You can therefore express Equation 7 as:

Equation 8 reduces the number of multiplications to T/2 or (T-1)/2 for even or odd filters respectively. If you need an anti-symmetrical FIR filter, change the addition in the parenthesis to a subtraction.


DA-Based Altera-Oriented Design Methodology
Each logic cell in Altera FPGAs has a four-input LUT. With the logic cell, you can perform any Boolean function of four variables. The most efficient way in terms of area and speed to implement filters in these FPGAs is to limit the number of variables of the n function to four. Note that you need three four-input LUTs to implement a five-variable function. Due to this requirement, the partitioning of the sum of T terms into K sums of four variables reduces required memory to store n from 2T/2 words (T/2 due to symmetry considerations) to:

For example, for a 16-tap FIR filter (T=16) the number of storage elements you need for a direct implementation of n is 28. If the sum is divided by K=2, needing 4n sums (8n=4n+4n), the storage you need becomes 24+24 = 25 words.


Number of Taps Number of n Number of Taps Number of n
7, 8 4n 17, 18 24n+1n
9, 10 4n+1n 21, 22 24n+3n
11, 12 4n+2n 43, 44 54n+2n
15, 16 24n 63, 64 84n

Table 2:  Distributed Arithmetic Blocks reduce FPGA resource requirements for implementing filter functions

This algorithm can be easily computed in a recursive manner:

for i in 4 down to 1

(T/2) mod i = number of(in) ;

(T/2)=(T/2)-i* number of(in) ;

where mod is the arithmetic operator "module" and number of(in is the number of type in blocks.

Figure 4:   Eight-Tap Linear-Phase FIR filter

Figure 4 shows a structure of an eight-tap linear-phase FIR filter. You delete the shadowed register and adder for an odd FIR filter. To design filters with a number of taps higher than eight, for example a 12-tap FIR filter, the structure comprises two blocks. One block represents an eight-tap filter (with only four-input LUTs) and the other a three-tap filter (with only two-input LUTs).

Figure 5:   12-tap FIR filter pipeline implementation

Figure 6:43-tap FIR filter

Figure 7:81-tap FIR filter

Pipeline and Parameter Description
Pipeline implementation increases the speed of the filter model. However, FPGA speed is limited by the device's routing delays. The pipeline has been done by registering each cell, shown by the dark lines in Figure 5. This option does not increment the size of functional area, since each logic cell is composed of a LUT and a register. In the pipeline versions, the latency you get is 5+log2(number of block), where 5 is the latency of the block.

You can use VHDL to describe each of the above blocks in a parameterized macro-module. With the model, you can vary several FIR-filter parameters including the number of taps, the kind of structure (symmetrical or anti-symmetrical) and the use of pipelining. The filter implementation, based on company recommendations, uses Altera's lpm_add_sub function.


FPGA Implementation
We have implemented several filters in an Altera FPGA. The chip chosen for the implementations is the EPF10K50VRC240-3, which comprises 3184 flip-flops, 2880 Logic Elements and 20,480 RAM bits. Filter implementation uses default placement and routing compilation options. Furthermore, the implementation used carry-chain lines to improve the speed of the whole adders.

Figure 8:   Maximum frequency of the DA FIR filters

Figure 8 shows the speed of both pipelined and non-pipelined versions of the filters. Note that the pipelined-circuit frequencies are practically constant, close to 40MHz, even with increasing number of filter taps. It effect is caused by the path between two pipelined registers, that can reach a high interconnection delay. In non-pipelined filters, the maximum frequency is similar for all versions, around 14MHz, and speed is essentially tap-independent.

Figure 9:   Occupation of the DA FIR filters

Figure 9 shows the occupation for both pipelined and non-pipelined filters. The horizontal lines represent the maximum capacity of each chip in the FLEX10K family (from 10K10 to 10K50). The graph shows that you can fit up to a 13-tap FIR filter (non-pipelined) in a EFP10K10. Extrapolating the graph show that you can implement up to 45-tap FIR filter in an EPF10K50.

We have described a methodology to construct high-speed FIR filters in Altera FPGAs. You can use this methodology to design symmetric or anti-symmetric filter structures, and can also include pipelining. Use of distributed arithmetic contributes filters' high-speed. We have also built a parameterized macro-model using synthesizable VHDL. You can use the resulting filters in real-time applications up to 40 MHz. This upper-frequency is essentially independent of the number of filter taps. Results also show that you can implement a 45-tap FIR filter in an Altera EPF10K50 FPGA.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form