Design Article
Introduction to MPEG-4 Video Compression
Richard Quinnell
9/15/2004 12:00 AM EDT
Digital video currently follows the MPEG-2 standard, but improvements in image processing technology are set to move MPEG-4 to the forefront of video compression.
Millions of DVD disks, satellite receivers, and streaming media processors have utilized the video compression schemes defined in the Motion Picture Experts Group (MPEG) standards. Most existing content uses the MPEG-2 scheme, but that may soon change. The MPEG committee last year approved the MPEG-4 standard, which handles video images as objects that form only part of a rich multimedia experience. In addition to an enhanced experience, MPEG-4 provides the advanced video coding (AVC) compression process that cuts the bit rate by as much as 50% for the same image quality as MPEG-2.
While popular, the MPEG-2 standard has demonstrated some drawbacks in its implementation. Chief among these are a tendency toward block-like artifacts in low bit-rate implementations and drifting image quality when the encoder and decoder lose correlation. Further, many experts feel that the basic algorithms in MPEG-2 are reaching their limits in terms of compression efficiency. Meanwhile, demand for higher resolution content and lower bandwidths is demanding continual improvement in efficiency.
To meet these demands, the MPEG committee defined the MPEG-4 standard. The standard goes beyond video compression, however, to define a set of tools that can be used to encode and transmit multimedia content that can include video, text, sound, and animation as independent objects, allowing each media type to be handled independently for maximum efficiency. Further, the object-oriented approach allows the media to be combined as overlays and video windows of different sizes, shapes, and transparency without compromising any object's data. The standard also supports scaleable image quality, digital rights management, and interactivity, and "future-proofs" itself by providing a mechanism for incorporating new encoder/decoder (codec) image compression schemes as technology improves.
Sprite coding is another of the image tools defined in the MPEG-4 standard. Sprites are text or graphics overlays that are static or nearly so. The ability to handle sprites separately from the final image allows the MPEG-4 coder to send the sprite image only once, not every image frame.
Scalability is a third tool that MPEG-4 provides, one that allows an encoded stream to match several different transmission channel bandwidths simultaneously or adapt to varying channel capacity. The standard allows the data stream to carry a baseline image along with image quality enhancements that the decoder can apply. If the channel bandwidth is low, only the baseline image need be transmitted. As channel capacity increases, the enhancements can also be sent to improve the decoded image quality.
These tools, including the image compression tools, do not define image processing algorithms, however, they define a method for handling images. The difference is that an algorithm completely specifies the detailed processing needed to produce the final image while the MPEG-4 tools define only the sequence and functions of the processing steps. The actual implementation of these functions is up to the designer. Thus, MPEG-4 compliant encoders and decoders can vary in their resultant image quality and computational efficiency.
To promote compatibility among MPEG-4 encoder and decoder implementations, the committee defined a series of compression "profiles" that further define the processing steps. These profiles specify which image compression tools are to be used, the resolution levels for processes such as motion compensation and quantization, and the type of data compression coding to be employed. Profiles are further broken down into performance levels that place limits on image size and encoded data bandwidth.
To navigate the details of the MPEG-4 profiles, it is helpful to first have an understanding of the basic processing steps used in the video compression process. These basic steps, common to all the MPEG standards, are outlined in Figure 1. The steps include mode selection, temporal compression, spatial compression, and entropy encoding. Although this discussion focuses on the luminance (intensity) component of an image, the processing steps apply to both the luminance and chrominance (color) components of a YCrCb color coded signal. The chrominance component processing uses a lower resolution than applies to the luminance signal, reflecting the eye's relatively limited response to color.
Figure 1: The basic MPEG compression method utilizes motion compensation to remove temporal redundancy and the discrete cosine transform followed by quantization to remove spatial redundancy in video images.
The first step, mode selection, divides the incoming image frame into rectangular arrays of 16x16 pixels, called macroblocks. For each macroblock, the encoder then chooses whether to use intra-frame or inter-frame coding. Intra-frame coding uses only the information contained in the current video frame and produces a compressed result called an I-frame. Intra-frame coding can use information in as many as five other frames, occurring before or after the current frame. Compressed results that only use data from previous frames are called P-frames, while those that use data from both before and after the current frame are called B-frames.
If inter-frame coding has been chosen, the next step is temporal compression. This is accomplished by applying motion estimation and motion compensation to the macroblock. The encoder scans the macroblocks of stored reference frames to find a match. It then encodes the current macroblock as a vector describing the motion of the matching macroblock, rather than encoding the pixel values themselves. This results in a more compact representation of the macroblock.
|
View the sidebar for more information on converting spatially varying signals to their corresponding frequency spectrum.
|
||
Following temporal compression, the spatial compression step applies a discrete cosine transform (DCT) to convert the image data into spatial frequency data. The encoder then reads the entries of the spatial frequency array in a zig-zag fashion (see Figure 2), which produces a serial data stream ordered from lowest to highest spatial frequency. The next step is quantization, which is where the actual compression occurs. The frequency components are quantized with a resolution that depends on the frequency. Because the eye is less sensitive to high spatial frequencies, those components can be adequately represented with fewer bits. In many cases, the quantization step truncates frequency components to a zero value. The encoder takes advantage of this by using run-length coding on the data, replacing strings of zero data values with a count of how many zeros follow.
Figure 2: The advanced video coding approach employs some prediction within a video frame as well as motion compensation from other frames to form predicted frames, the encodes the differences rather than the absolute value to increase compression efficiency.
Finally, the stream of image data, motion vectors, and other information needed for image reconstruction are combined and entropy encoded to obtain the final compressed data stream. Entropy coding maps data bit patterns into code words, replacing frequently-occurring bit patterns with short code words. The result is fewer bits needed (on average) to represent the data.
Within these various steps, MPEG-4 allows considerable variation. Depending on the profile, the encoder may or may not be able to use B-frames, may be limited in its choice of entropy encoding, or may be limited in its motion estimation accuracy. Thus, a detailed explanation of MPEG-4 image processing must focus on a specific profile.
In the first step, mode selection, the AVC scheme chooses intra-frame or inter-frame compression as with other MPEG schemes. One of the places it differs is how it encodes I-frames. Rather than encoding each macroblock in isolation, the AVC approach uses information from previously encoded blocks that lie nearby in the image frame. It predicts the current block using these neighboring pixels, using the prediction for the next step rather than the actual value.
In the inter-frame operation, AVC can use as many as five frames in its search for motion estimation. Further, it can work with sub-blocks of the 16x16 macroblock, choosing from 16x16, 8x16, 16x8, 8x8, 8x4, 4x8, or 4x4 sub-blocks. The motion estimation and compensation can be as accurate as 1/8-pixel. By comparison, MPEG-2 used only the 16x16 macroblock, a maximum of two search frames (one prior and one later), and 1/2-pixel accuracy. As with the intra-frame step, the result is a predicted block rather than the actual block value.
The fact that all the blocks utilize some form of prediction is one of the key compression steps in the AVC profile. The scheme subtracts the predicted block values from the actual values before transformation to spatial frequencies. With only this error term being transformed, the result is likely to have fewer frequency components that will pass the quantization step, especially with precision sub-pixel motion estimation. Reordering the list of frequency components puts all the zero-value words together for run-length coding. Finally, both the DCT terms and other reconstruction datamotion vectors and suchmerge into a bit stream for final entropy coding.
To ensure that the decoder is able to reconstruct the frame based on the prediction values provided, the intra-frame prediction must use as its reference the same macroblocks that the decoder would have available. Thus, the encoder includes a feedback path that reconstructs the macroblocks and supplies the result to the prediction algorithms. The intra-frame coding step uses unfiltered reference macroblocks. The motion estimation step uses full reference frames that have passed through a deblocking filter that helps remove compression artifacts from the image. This feedback path performs the same steps used in the decoder to reproduce the image for display, eliminating the drift experienced in MPEG-2 compression.
The AVC approach is computationally more intense than other MPEG-4 profiles, but it has resulted in much greater compression efficiencies. Industry tests indicate that it uses only 50% of the bit rate of MPEG-2, and a 40% savings over other MPEG-4 profiles for the same image size and perceived quality. As a result, it is likely to become the preferred approach for next-generation digital image distribution, including broadcast, cable, DVDs, and networks.
The MPEG standards will not rest there, however. The committee is actively working on MPEG-7, which provides a way of describing video image content for search and retrieval. Merged with MPEG-4, the standard will allow management and searches of multimedia archives based on content. Beyond that, the MPEG committee has envisioned the MPEG-21 multimedia framework. This framework describes how different elements of media creation, storage, distribution, and playback can combine to provide for interoperable delivery and consumption of multimedia content worldwide.



