Design Article
FFT Core for Adaptive Cruise Control Using FPGAs
Christober Rayappan
5/10/2005 12:00 AM EDT
A new type of speed control called Adaptive Cruise Control (ACC) (Figure 1) is being used on some new model vehicles. These systems allow you to set a following distanceor time intervalbetween your vehicle and the vehicle ahead and lets you set the maximum speed. When you come up to a slower moving vehicle in front of you, or if a vehicle changes into your lane, the system automatically closes the throttle and/or applies brake pressure to slow the vehicle and maintain the set following distance. All of the ACC systems available today are built around sensors that detect the vehicle ahead through the use of radar.
The frequency shift between the original sent signal and the received signal gives the distance and the relative velocity between the two vehicles. Capturing and processing this information in real time requires the use of math intensive digital signal processing (DSP) algorithms. Software processing cannot meet these performance requirements, and it often takes several conventional DSP processors to perform such high-speed tasks. The speed critical process (signal processing block) can be designed using FPGAs.
Signal Processing Block
To meet the performance requirements, the signal processing block is implemented using FPGAs. Radar works by bouncing electromagnetic energy off the target, recording the echo and making some useful observation from the data. Hilbert transform is used for the phase shift of the incoming signal. Decimating the filter input by four involves discarding three out of four samples. The matched filter can be decomposed into a sum of four sub-filters, each of which filters a particular phase of the input signal. Bit parallel structure can be used for implementing matched filters in FPGAs (Figure 2).
Velocity Analysis
Another important part of the ACC system is to determine the velocity of the objects. This is done in the frequency domain of the input signal. In order to detect the frequency shift between the original sent signal and the received signal, the DC offset is taken into account.
This is obtained by subtracting the mean of the signal from the signal itself. The FFT of both the sent and received signal is taken to see the spectrum of the two. The shift of the spectrum from the one to the other signal shows the Doppler shift. The peak of the received signal is subtracted from the peak of the sent signal to obtain the Doppler shift.
The velocity is calculated by (Equation 1). Note that because sampling of the signal is done at a given sampling rate, this equation needs to be modified a little by dividing the right hand side by the sampling frequency as well.
(1)
Solving this equation for the velocity (v), you can identify the velocity of the moving object.
From the above discussion it is clear that the FFT processing is performed to display the spectrum of the Doppler return, which translates directly into velocity, after considerations for geometry. The character of the spectrum (noisy, smooth, wide band, narrow band) is an indication of the vehicle. The processing required by the FFTs depends on the rates selected by the operator. The 25 KHz Doppler signal can be rate converted up to give finer frequency (velocity) resolution, and can be performed lapped (overlapping sliding buffers) to improve low velocity sensitivity of the system. All of these effects translate into an equivalent number of 1K complex FFTs per second. As the number of FFTs per second increase, the computational demand on the processor increases. At the same time, the demands of the processor for data increase as well.
The DSP Power of an FPGA
The following characteristics describe the DSP power of FPGAs:
- Possible to efficiently and cost effectively implement these DSP functions
- Increases integration possibilities and potentially reduces the number of discrete components
- Multi-billions MACs per second.
- Tremendous parallel processing capability
- High memory bandwidth
- High I/O bandwidth and flexible interfaces
- Supports 20-High speed signal and memory interface standards (for example, LVD, SSTL).
Since FFT needs lots of computational power and efficiency than the other signal processing blocks (Hilbert transform, Matched filter), this paper concentrates on developing the FFT core for ACC systems.
Detailed Design Description
The architecture of the FFT butterfly is depicted in Figure 5. From this it can be seen that the design contains the following components:
- Adders
- Multipliers
- Two's complement circuit
- Twiddle factor ROM
This paper will cover the design of the multiplier and adder with saturation in detail. Following this design entry, layout and simulation strategies are described. The reader is assumed to be familiar with various FFT algorithms and their classification



.
Design of the Butterfly
This function implements the Radix-2 decimation in frequency butterfly described by the following equations:
(2)
(3)
where a and b are the complex inputs to the butterfly, A and B are the complex outputs and
(4)
is the complex twiddle factor. The number of adders and multipliers in the design depends on the input (complex or real).
Figure 4
Assumptions made before the design process:
- Input is a two's complement number
- Input is a sign extended 8-bit number
- 8-bit adders and multipliers are used
- Input is a fractional number
- Fixed point representation is followed.
The Radix-2 butterfly requires six adders and four multipliers with some two's complement circuit to represent the negative numbers. The design block is given in Figure 5.
The carry look ahead adder is used for the design of adders.
Overflow Detection and Saturation in Fractional Addition
Let A=an-1,an-2,...a1a0 B=bn-1,...b1b0 be two n-bit integers. The resultant sum S=A+B is also an n-bit integer. Overflow occurs when the result S cannot be accommodated in an n-bit result. Thus, when two number are added, the detection of overflow is given by the expression
(5)
Where Cn-1 and Cn-2 are the carry values obtained when the most significant magnitude digit an-2 and bn-2 and the sign digits an-1 and bn-1 are added. In order to identify the sign of the overflow, two other functions can be defined
(6)
The saturated values for overflow detection are as follows:
- If D=1 and D+ = 1 then saturated value S=sn-1,sn-2..s1s0 is S=011...11
- If D=1 and D- = 1 then saturated value S=sn-1,sn-2..s1s0 is S=10...00
- If D=0 then S=sn-1,sn-2..s1s0
There is an implied binary point between the most significant bit and sign bit. The above equations can be implemented with multiplexer and some control circuitry, as shown in Figure 6a.
|
Design of Digital Multipliers
Multipliers are one of the most important components in digital signal processing applications. Most of the digital signal processors provide built-in hardware multipliers for high-speed operations. These multiplier units should be designed for high speed, low complexity in hardware implementation and have regular structure for ease in implementing in the hardware. There are various multiplier algorithms for implementing the same. The most popular multiplier algorithms widely used are Baugh-Wooley and Booth's algorithms. The Baugh-Wooley algorithm is used for partial product generation.
The Baugh-Wooley Algorithm
Charles Baugh and Bruce Wooley gave an algorithm for 2's complement multiplication
. The two's complement multiplication is converted to an equivalent parallel addition, in which each partial product bit is the AND of a multiplier bit and multiplicand bit, and signs of the entire partial product bit are positive. With this algorithm, all the subtractor in the hardware can be replaced by equivalent adders.
Let A=a7a6a5a4a3a2a1a0 B=b7b6b5b4b3b2b1b0 be the two 8-bit two's complement integers. The Baugh-Wooley algorithm of multiplication is given by figure 6b.
To obtain the final product, these partial products are added column wise.
Partial Product Addition
The majority of hardware for multipliers comes from the addition of partial products to get the final product output. This requires adders for adding the partial products. There are various algorithms for adding partial products in a multiplier, such as the Wallace tree and "overturned stairs" adder trees.
"Overturned Stairs" Adder Trees
Multioperand adders are widely needed for designing multipliers, FFT, digital filters, and inner product computers. Wallace trees are known for their optimal computation time, when adding multiple operands to two numbers using carry save adders (CSA). Wallace trees are very difficult to implement because of their complex wiring. The design of a Wallace tree is often too expensive to justify the speed obtained. The Overturned Stairs tree introduced by Zhi-Jian (Alex) Mou is implemented for multiplier design
. An important feature of OS trees is their recursive structure, which leads to a regular and compact design.
OS Tree Structure
The structure of the OS tree is determined by the number of words that need to be added. As shown in Figure 7, the width and height of the tree increases when the number of input (words) is increased. The tree height, nOS is given by
(7)
where NOS is the input words to the OS tree.
The width of the OS tree is given by NOS/3
When the input bits are high the tree height of a Wallace tree is less than that of an OS tree. For most common DSP applications, the word length is not particularly longfor instance, 12 to 24 bits. In these cases, there is no direct advantage to using Wallace trees
. The Wallace tree height is only one level lower compared to the OS tree
.
Some basic building blocks can be identified in the trees (Figure 7): body, root, connector, and branch. These building blocks can be used to form the general structure of the OS tree. The general structure can be easily derived. Assume that we have a tree of height nOS = j+1. A body of height j-1, a connector and a branch can construct the body of height j. The connector is constructed by two CSAs. The three outputs from the connector appear as inputs to the root, which is constructed by a single CSA. The branch height j-2 is formed by j-2 CSAs on top of each other with proper interconnections. Using these properties, an OS tree of arbitrary tree height can be constructed in an iterative way.
|
An 8-bit OS tree, which is designed for implementing FFT, is shown in Figure 8.
Underflow Detection/Overflow or Saturation in Fractional Multiplication
When two n-bit integers A&B are multiplied, the result is a 2n bit number. The final product value is always greater then or equal to the greatest number among A&B. In the final result, when the only n least significant number is retained, an overflow may occur. However, in two n-bit fractional numbers, the result is still a 2n bit number, but the magnitude of the resultant product is lower than the lowest number among the A&B. Hence there exists a possibility of underflow in fractional multiplication. Overflow in fractional multiplication occurs when the multiplication -1X-1 is performed.
In order to keep the word length constant as required in many applications, only the n most significant bits are retained out of 2n bits in fractional multiplication
.
The method used to retain 'n' most significant bits is either truncation or rounding. In truncation, the n least significant bits are truncated to retain n most significant bits. In rounding, the n least significant bits are truncated and for the remaining result after truncation a value 0.00...1 is added so that the final value is the value nearest to the original value. Thus for fractional multiplication, the least 'n' significant bits are retained. If the truncated number is a positive fraction, then add 0.0...1 to the retained value if truncated bits does not contain all zeros. If the truncated number is a negative fraction, then no correction factor is added and just 'n' most significant bits are related as the final value. The hardware circuit for the above concept can be implemented using multiplexer and some logic circuits (Figure 9).
|
Implementation of an 8-bit by 8-bit Multiplier
The 8-bit by 8-bit multiplier 2's complement Baugh-Wooley multiplier was implemented in the Xilinx XCV-1000-5bg 560 FPGA. This FPGA has an 1000K-gate capacity.
|
After place and route and back annotation, the maximum frequency of operation was limited to 50 MHz and resources utilized were 1% of the total available. For implementing a butterfly block of radix-2 FFT the resources utilized are 6% of the total available slices. The simulation results are shown in Figure 11 and Figure 12.
|
Implementation of a Radix-2 FFT
The 8-point Radix-2 FFT was implemented in the Xilinx XCV-1000-5bg 560 FPGA. The gate capacity of the FPGA is 1000k.
After place and route and back annotation, the number of slices consumed by the Radix-2 FFT is 2402 out of 12288, which is 19% of the total capacity. The maximum frequency of operation was limited to 18.485 MHz. The frequency operation is reduced due to more hardware, which is required to built the FFT. However if the hardware is optimized, frequency of operation can be increased.
|
.
Implementing a butterfly block of the Radix-2 FFT uses 6% of the total available slices in the Xilinx XCV-1000-5bg 560 FPGA (with a gate capacity of 1000K). The whole FFT block consumes 19% of the total capacity. The maximum frequency of operation was limited to 18.485 MHz. The power consumed by the butterfly unit is 540mw.
This butterfly design remains the same for any length of the Radix-2 FFT. The average error due to the quantization of input to 8 bits and due to the 8-bit length of the multiplier and adder is 35% compared to the Matlab output (32 bit precision). Therefore, it is necessary to increase the input bit size and the length of the multiplier and the adder, but the design remains the same. The matched filter which has multipliers and adders can be designed easily in FPGAs, since the basic building blocks are constructed. This increases the speed of operation, which plays a vital role in adaptive cruise control.
Christober Rayappan is working as a technical consultant at Infineon Technologies. His areas of intrest include finite word length effects in signal processing, spectral analysis, and coding theory. He holds a Masters degree in communication engineering from the Vellore Institute of Technology. He is currently involved in the development of DSP libraries for Infineon 16-bit microcontrollers. His email address is Rayappan.External@infineon.com.


