News & Analysis

DSP convolutions offer pixel-by-pixel video filtering control

David Katz,Rick Gentile, Applications Engineers, Blackfin Applications Group, Analog Devices Inc., Norwood, Mass.

4/7/2003 11:57 AM EDT

DSP convolutions offer pixel-by-pixel video filtering control
Until recently, designers needing to perform video or image analysis in real-time, a typical requirement of medical, industrial and military systems, had to resort to expensive specialized processors. With the advent of fixed-point, high-performance embedded media processors, however, it has become possible to do image processing economically in real-time. To develop truly efficient algorithms, it is essential for designers to take advantage of the architectural features these processors provide.

This article discusses how digital image-filtering algorithms can leverage the multimedia-friendly features of an embedded media processor's architecture. The features and instruction set for Analog Devices' Blackfin DSP are used as a reference point, but the same concepts apply to many high-performance media processors as well.

Although the clock speeds of fixed-point processors now reach beyond 300 MHz, this speed increase alone doesn't guarantee the ability to accommodate real-time video filtering. Just as important are multimedia-geared architectural features and video-specific instructions. Any filtering operation takes a signal and runs it through a comb (or a sieve). In DSP terms, this means multiplying the digital number that represents a sample of the original signal by a digital number (or polynomial function) representing the filter operation, and then accumulating the result.

Most video applications need to deal with 8-bit data, since individual pixel components (whether RGB or YUV) are usually byte quantities. Therefore, 8-bit video arithmetic logic units and byte-based address generation can make a huge difference in manipulating pixels. This is a nontrivial point, because DSPs typically operate on 16-bit or 32-bit boundaries.

Consider for a moment the demands video filtering places on an image processor: For a VGA image (640 x 480 pixels/frame) at 30 frames/second, there are 9.2 Mpixels/s. Now consider whether, for an 8-bit pixel, the "9" multiplies and "8" accumulates need to be done serially: That's (9+8)*9.2 = 156 million instructions/s. If the accumulates are done in parallel with the multiplies, the load will be reduced to 9*9.2 = 83 Mips.

In addition to Mips, the image processor can make use of a flexible data register file. In traditional fixed-point DSPs, word sizes are usually fixed. However, there is an advantage to having data registers that can be treated as either a 32-bit word (for example, R0) or two16-bit words (R0.L and R0.H, for the low and high halves, respectively). The utility of this structure will become apparent below.

Finally, dedicated single-cycle instructions can be very convenient for providing efficient multimedia-coding algorithms. A good example of is a "sum of absolute differences" instruction, which can add up differences among several pixel sets simultaneously, indicating how much a scene has changed between frames.

Since a video stream is really an image sequence moving at a specified rate, image filters need to operate fast enough to keep up with the succession of input images. Thus, it is imperative that image filter kernels be optimized for execution in the lowest possible number of processor cycles. This can be illustrated by examining a simple image filter set based on two-dimensional convolution.

Convolution is one of the fundamental operations in image processing. In a two-dimensional convolution, the calculation performed for a given pixel is a weighted sum of the light-intensity values from pixels in its immediate neighborhood. Since the neighborhood of a mask is centered on a given pixel, the mask usually has odd dimensions. The mask size is typically small relative to the image, and a 3 x 3 mask is a common choice, because it is computationally reasonable on a per-pixel basis, but large enough to detect edges in an image.

As an example, the output of the convolution process for a pixel at row 20, column 10 in an image would be:

Out(20,10)=A*(19,9)+B*(19,10)+C*(19,11)+D*(20,9)+E*(20,10)+F*(20,11)+G*(21,9)+H*(21,10)+I*(21,11)

It is important to choose coefficients in a way that aids computation. For instance, scale factors that are powers of two (including fractions) are preferred, because multiplications can then be replaced by simple shift operations. The delta function is among the simplest image manipulations, passing the current pixel through without modification.

On the edge

With an edge-detection mask, on the other hand, one detects vertical edges, while another detects horizontal edges. In the matrix-numbering scheme, high output values correspond to higher degrees of edge presence.

A smoothing filter can also be utilized with the same 3 x 3 kernel. It performs an average of the values of the eight surrounding pixels and places the result at the current pixel location. This has the result of "smoothing," or low-pass filtering, the image.

A filter operation known as an "unsharp masking" operator produces an edge-enhanced image by subtracting from the current pixel a smoothed version of itself (constructed by averaging the eight surrounding pixels).

Let's take a closer look at the two-dimensional convolution process. The high-level algorithm (the road map that applications programmers follow) can be described by the following steps:

1. Place the center of the mask over an element of the input matrix.

2. Multiply each pixel in the mask neighborhood by the corresponding filter mask element.

3. Sum each of the multiplies into a single result.

4. Place each sum in a location corresponding to the center of the mask in the output matrix.

After each output point is computed, the mask is moved to the right by one element. On the image edges, the algorithm wraps around to the first element in the next row. As a result, the usable section of the output matrix is reduced by one element along each edge of the image.

The "inner" loop of the DSP operation is where the multiply/accumulate (MAC) operations are performed. By aligning the input data properly, both MAC units on a processor like the Blackfin can be used to process two output points at a time. During the same cycle, multiple data fetches occur in parallel with the MAC operation. Not only are multiple arithmetic operations occurring each cycle, but load/store operations also take place in parallel to achieve even greater efficiency. (A detailed example of this process — the register operations for each cycle — can be found at www.planetanalog.com/features.)

As image filtering goes, two-dimensional convolution with a 3 x 3 mask is relatively straightforward to implement. However, selecting a processor designed for real-time image processing and understanding its architectural components can increase algorithm efficiency and reduce cycle time, in this case by a factor of four. This understanding, in turn, can provide a strong foundation for implementing more complex image-processing functionality on the same platform.

See related chart

See related chart





Please sign in to post comment

Navigate to related information

EE Buzz DesignCon

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form