Design Article
Enhancing ARM-based embedded SoC performance in high-bandwidth human-interface applications
Dany Nativel and Jacko Wilbrink, Atmel
11/27/2006 1:00 PM EST
For example, digital cameras now have multi-million pixel sensors with huge bandwidth and memory requirements to process and store the vast amount of data. On the other hand, voice and music require less bandwidth. However, streaming content adds real-time constraints to the communications channel.
While chip vendors have addressed such processing challenges with high throughput CPU cores that have DSP extensions, they have not done enough to accommodate the massive amounts of data that must be transferred between the peripherals, memories, the CPU and any on-chip co-processors.
System developers evaluating microcontroller alternatives for their design, or who are evaluating cores for use in their SoC designs should look beyond raw MIPS. Rather they should verify that controller's ability to move massive amounts of data without gobbling up all the CPU cycles.
It is necessary to evaluate closely architectural alternatives that off-load data transfers between the peripheral and the memories with a peripheral DMA controller. Dedicated busses that service on- and off-chip memory, the CPU and any high bandwidth peripherals will also help eliminate the possibility of bus bottlenecks.
Finally, adding multiple external bus interfaces that allow simultaneous, parallel processing of data from external memories by the CPU and on-chip co-processors, makes it possible for a system developer to take advantage of the full processing potential of advanced cores such as the ARM926EJ-S.
The drawbacks of current SoC designs
Before the advent of data-centric applications, the limiting factor in
most applications was the ability of the CPU to process small amounts
of data quickly. Recent innovations in controller architectures,
particularly the addition of DSP extensions to the instruction set and
much faster clocks, have overcome the processing challenges.
Controllers, such as those based on ARM's 926EJ-S core, can execute a
huge processing load. Unfortunately, communications with, on- and
off-chip memories have not kept pace.
Conventional 32-bit processors directly manage all communication links and data transfers. They first load data, received by a peripheral, to one of their internal registers and then store it from this previously loaded register to a scratchpad stored in on-chip SRAM or external SDRAM. The CPU must then process the data and copy it back, through an internal register, to another communication peripheral for transmission. This Load Register (LDR) Store Register (STR) scheme requires at least 80 clock cycles for each byte transferred.
An ARM9 processor, running at 200 MHz with an internal bus at 100 MHz, reaches its limit when a peripheral transfers data at about 20 Mbps " not enough to service an SPI or SSC, much less handle 100 Mbps Ethernet transfers (see Figure 1, below).
![]() |
| Figure 1. Traditional ARM Data Transfer Structure |
If the memory management unit (MMU) and the instruction and data caches are disabled, the ARM9 controller is limited to only 4 Mbps, not enough to handle even a high-speed UART. The traditional solution to this severe bandwidth limitation has been to increase the processor clock frequency, also increasing both power consumption and heat dissipation. However, even the highest available frequency may not be sufficient to achieve the required bandwidth of today's applications.
Current applications may integrate high-bandwidth peripherals such
as 100 Mbps Ethernet, 12 Mbps USB, 50 Mbps SPI, a VGA LCD controller
and a 4+ megapixel camera interface. With the advent of these
high-speed peripherals, even a 1 GHz processor does not have enough
processing power to handle all the data transfers.
At 100 Mbps, the CPU does nothing but move data because there simply isn't any processing power left to do anything else. Thus, although processors can easily achieve the computational throughput to execute an application, they are not capable of moving data fast enough. The challenge is no longer computational; it's bandwidth.
Manufacturers have tried to solve this problem by adding FIFOs to their on-chip peripherals. Unfortunately, FIFOs do not increase bandwidth, they just lower data transfer peaks by spreading the bus load over time. The archaic LDR/STR processor architecture requires the CPU to execute each and every one of those byte transfers, robbing it of cycles needed for processing.
A new approach to processor architecture that includes the use of simple, silicon-efficient DMA (Direct Memory Access) inside the individual peripherals and the addition of dedicated busses between high-throughput elements on the chip provides a lower cost, lower power solution to this problem.
The care and feeding of peripheral
DMA
The use of DMA is a natural evolution for embedded architectures that
have seen the number of on-chip peripherals and data transfer rates
growing exponentially. DMAs solve part of the problem by allowing
direct peripheral-to-memory transfers without any CPU intervention,
thus saving valuable CPU cycles. DMAs can transfer data using one-tenth
as much bus bandwidth as is required by the processor.
However, DMA controllers are designed primarily for memory-to-memory transfers. Such DMAs offer advanced transfers modes like scatter-gather and linked lists that are very effective for memory-to-memory transfers but are not useful for peripheral-to-memory data transfers. This adds unnecessary software overhead and complexity to the system design.
A better approach is to use an optimized peripheral DMA between the peripherals and the memory. Peripheral DMAs requires 90% less silicon than memory-to-memory DMAs, making them cost-effective to implement dedicated DMA channels for each peripheral.
Moving the DMA channel configuration and control registers into the peripheral memory space greatly simplifies the peripheral drivers (Figure 2 below). The application developer needs only to configure the destination buffer in memory and specify the number of transfers. The software overhead is minimal.
![]() |
| Figure 2. Optimized Peripheral to Memory DMA to deal with bus bottlenecks |
Each UART or SPI, for example, has two dedicated PDC channels, one each for receiving and transmitting data. The user interface of a PDC channel is integrated in the memory space of each peripheral, and contains a 32-bit memory pointer register, a 16-bit transfer count register, a 32-bit register for next memory pointer, and a 16-bit register for next transfer count. The peripherals trigger PDC transfers using transmit and receive signals.
When the peripheral receives an external character, it sends a Receive Ready signal to the PDC which then requests access to the system bus. When access is granted, the PDC starts a read of the peripheral Receive Holding Register (RHR) and then triggers a write in the memory. After each transfer, the relevant PDC memory pointer is incremented and the number of transfers left is decremented. When the memory block size is reached, the next block transfer is automatically started or a signal is sent to the peripheral and the transfer stops. The same procedure is followed, in reverse, for transmit transfers.
When the first programmed data block is transferred, an end-of-transfer interrupt is generated by the corresponding peripheral. The second block data transfer is started automatically and the processing of the first block can be performed in parallel by the ARM processor, thereby removing heavy real-time interrupt constraints to updating the DMA memory pointers on the processor, and sustaining high-speed data transfers on any peripheral.
It is possible, at any moment, to read the location in memory of the next transfer and the number of remaining transfers. The PDC has dedicated status registers which indicate if the transfer is enabled or disabled for each channel. Control bits enable reading of the pointer and counter registers safely without any risk of their changing between both reads.
The peripheral DMA frees the host CPU to focus on the computational tasks it was designed for without wasting cycles on data transfers. In fact, a peripheral DMA controller (PDC), can be configured to automatically transfer data between the peripherals and memories without any CPU intervention at all. Additionally, the PDC automatically adapts its addressing scheme according to the size of the data being transferred (byte, half word or word).
A PDC integrated in a 10-bit ADC configured to operate as an 8-bit will generate byte transfers and increments its address pointer by 1 after each transfer automatically. In 10-bit mode the same PDC will transfer half words and increment its address pointer by 2.
Effective use of a multi-layer bus
structure
Another problem facing data-intensive applications is on-chip
bus bandwidth. When multiple DMA controllers and the processor push
massive amounts of data over a single bus, the bus can become
overloaded and slow down the entire system. A 32-bit bus clocked at 100
MHz has a maximum data rate of 3.2 billion bits per second (Gbps).
Although that sounds like a lot, in data-intensive applications, there may be so much data that the bus itself becomes a bottleneck. Such is the case with internet radio where audio quality is a direct function of the ability to receive and process streaming content in defined timeslots, or GPS navigation involving interactive vector graphics. This situation can be avoided by providing multiple, parallel on-chip busses and a small amount of on-chip scratchpad memory (see Figure 3, below).
![]() |
| Figure 3. Multiple layered bus structure |
External Bus Interface
When an application shares external memory between the
processor and peripherals, the external bus interface limits the
bandwidth. The next step to increase bandwidth is to provide two
parallel external bus interfaces connected to the internal multi-layer
bus: one for system memory and one that supports a high-speed
peripheral or co-processor. In embedded applications with man-machine
interfaces, the required amount of memory is so huge that it is not
cost-effective to put it on the controller.
For example, a 24-bit color VGA panel requires a frame buffer of 900 KBytes. An LCD controller with this much SRAM would be prohibitively expensive so the frame-buffer must be stored in external RAM. The refresh rate is typically 60 frames per second. With a VGA (640x480 pixels) panel in 24-bit true-color mode, the CPU needs to fetch 7.2 Mbits of data 60 times per second, or 432 megabits per second (Mbps). A conventional 200 MHz ARM9 processor cannot possibly achieve this level of throughput.
Bandwidth can be increased by adding a second EBI and a 2-D (or other) graphics co-processor. (See Figure 4, below) The second EBI is connected to a second external memory that is used as an LCD controller frame buffer which is directly connected to the on-chip 2-D graphics co-processor that offloads line draw, block transfer, polygon fill, and clipping functions from the CPU. The performance gain achieved from a second external bus interface is application dependant but can be expected to be in the range of 20 to 40%.
![]() |
| Figure 4. Dual External Bus Interfaces |
This type of architecture is appropriate for data-intensive applications that have a graphical human-machine interface, such as networked medical monitoring equipment and GPS navigation systems.
By integrating 18 simple, silicon-efficient, single-cycle, peripheral DMA controllers (PDC), five DMA controllers with burst mode support to the USB host, Ethernet MAC, camera interface, LCD controller and 2D graphics controller, plus a memory-to-memory DMA controller with burst mode, scatter-gather and linked lists support, this architectural approach can off-load, from the CPU, the execution of data transfers between the peripherals and memories.
While a conventional ARM9 is overwhelmed by a 20 Mbps data rate, an ARM9 with sufficient peripheral DMA can easily handle the data transfers with 88% of its MIPS available for application execution.
Multi-layer Bus plus Generous
on-chip SRAM
Traditional 32-bit processors with a single 100 MHz bus, have a maximum
on chip transfer rate of just 3.2 Gbps to handle all instructions and
all data shifted back and forth between the on- and off-chip memories,
CPU and the peripherals.
Although it sounds like a lot, 3.2 Gbps may not be enough to support the massive amounts of data, intensive processing, and real time requirements of a system with an interactive human interface.
By implementing multiple dedicated busses between the peripherals, processor, data and instruction memories, plus ample of on-chip scratchpad SRAM, streaming content can be received and processed in defined timeslots, avoiding bottlenecks that can occur in a single-bus architecture. The SRAM can be partly configured as tightly-coupled data and instruction memory (TCM). Multiple busses provide multiple parallel on-chip data transfer channels, ensuring that a single peripheral does not overwhelm the bus arbiter (See Figure 5, below).
![]() |
| Figure 5. A typical Peripheral DMA enhanced ARM with multiple buses. |
A typical eleven bus ARM9 (see Figure 3, earlier) would have seven busses dedicated for the DMA controllers and their Ethernet MAC, USB host, Camera interface, LCD controller, 2D-graphics co-processor, the 2-channel memory to memory DMA controller and an 18-channel peripheral DMA controller (PDC).
Other busses might be dedicated to on- and off-chip memory. Two additional busses, one for data and one for instructions, can connect the processor with the tightly coupled memories. Finally, two busses can be used to connect the instruction and data cache controllers to the memories.
Once the memory address and block sizes are configured, the DMAs transfer data automatically. No additional programming is required. When two DMA's and/or the processor access the same memory, an arbiter controls the access using 1) round robin, 2) fixed or 3) default master arbitration schemes, as selected by the programmer.
The graphics in 2-D man-machine interfaces require nearly a GByte of external memory for the frame buffer alone, plus a 432 Mbps data rate just to refresh a 640 x 480 24-bit LCD (24-bit true-color mode). The required bandwidth is out of reach for conventional ARM9s.
The use of two external buses readily solves this problem: one for the system memory and one for the human interface. The second EBI should have dedicated busses to both the on-chip 2-D graphics co-processor and the LCD controller. This second EBI eliminates the need for the LCD controller and CPU to share memory, and can increase available CPU MIPS by 20% to 40%.
Conclusion
Some ARM-based controller vendors are employing these techniques to
meet the growing need for realtime data stream processing with human
interfaces. A variety of ARM7 and ARM9-based microcontrollers are
available today that allow high data rates and maximum CPU throughput.
Many ARM9s have multiple dedicated busses for the CPU instruction, data cache controllers, as well as all high- bandwidth peripherals. Depending on the number of on-chip peripherals, ARM9-based MCUs are available today that have between five and eleven independent 32-bit busses, and a maximum on-chip data rate of between 16Gbps to 41.6 Gbps. Finally, ARM-based controllers with dual external bus interfaces (EBI) support can support intensive graphics processing or large data buffers.
These architectural enhancements can enhance the ARM9's performance so that 20 Mbps data transfers that would overwhelm a conventional ARM9 can take place continuously with 88% of the processor's cycle available application execution. Providing separate memories for the CPU and PDC can increase the processor's available MIPS to 100%.
The combination of an eleven-layer bus, dual EBIs and peripheral DMA controller allow an ARM9 with LCD controller to refresh the 320 by 480 VGA screen 60 times a second with 100% CPU cycles still free for other functions!
This relatively simple, silicon-efficient addition of DMA busses and external memory interfaces to the microcontroller architecture turns a processor that effectively has no MIPS for application execution into one that can transfer all the data and still have 200 MIPS remaining for applications execution.
Jacko Wilbrink is an ARM marketing
manager, and Dany Nativel is the
technical product marketing manager for ARM-based MCUs at Atmel
Corp.








