News & Analysis

A Next Generation Multiple Processor Architecture for Real-Time DSP

Peter Warnes and Steve Bradshaw

4/1/1998 12:00 AM EST



,
,

The authors have had numerous years of experience designing real-time embedded systems and multiple-processor systems based largely on DSP architectures. We believe there are a number of issues that must be addressed when considering suitable multiple processor architectures. We particularly want to address those issues which can limit the performance and utilisation of DSPs within multiple processor architectures for real-time applications.

Having been committed to a single (TMS320C4x Comport based) architectural principle, and successfully delivering multiple processor DSP systems in a wide variety of (inter-compatible) board level implementations for over six years, it is important to us that we can establish "ideas" that will serve us well into the future with different and more powerful processors. It is our intention to introduce a series of modular TMS320C6x based products, offering similar, but more powerful capabilities to our TMS320C4x systems as a means of complimenting what we already do.

We need to find an architecture that supports what we want to do today and what our customers will ask us to do tomorrow. It is important for our customers that anything we do will cope with future DSP technology and system requirements so that they can continue to benefit from the clear progression path that we can provide them.

To this end, we must first abstract away from the specific details of particular DSP processors, board level designs, and products, and to focus on the real issue of what sort of architectural support is actually needed when attempting to utilise multiple DSP processors efficiently in real-time applications.

There is also a wealth of information within the computer industry, resulting from the significant commercial use and experimentation of a wide variety of multiple processor architectures for well over a quarter of a century. The DSP industry is a relatively new player on this block, perhaps only seriously committing to multiple processor architectures as a result of the introduction of some explicit technology support—in the form of the TMS320C40 processor by Texas Instruments.

However, blindly following multiple processor architectural techniques which have worked for general computing problems with no real-time operational constrains can however be quite hazardous to a development project. Just because there is familiarity with particular architectures, doesn't mean they are appropriate for real-time applications. It is essential that whatever architectures are utilised for supporting real-time DSP related applications, are capable of accommodating the needs of these applications.

The whole philosophy of our business is based on deciding what you need to solve a problem, as a concept that is mentally separated from the process of deciding how you will solve that problem. Hence, once we have established what sort of architectural support is needed, we can then move forward to establishing the characteristics of a "preferred" architecture, and ultimately a development strategy that can turn this to reality

It is even possible to establish a "preferred" processor model that can accommodate the needs of powerful multiple-processor architectures. The development strategy and resulting product implementations are beyond the scope of this paper. However, it is clear that a true test of the "preferred" architecture will not only be related to its performance and flexibility, but whether the principles are as easily applied with different DSP processors (and processor families).


Multi-Processor and/or Multi-I/O Systems

Multiprocessor systems are frequently discussed without consideration of the input and output devices required by the system. The whole purpose of having today's powerful DSP systems is to Process Digital Signals. These digital signals are derived from real world stimuli, and are possibly digitised versions of analog signals, or possibly digital I/Os. Either way there is no point in having a powerful DSP system unless it has a sensible strategy for getting the signals in and out of that system.

As an example, it is not unusual to hear someone that is considering the use of a DSP system to provide a specification along the following lines: "I need 14 analog inputs that are digitised to 12 bits at 1 MHz, and I need to process them in real time". This is a very profitable business, as that specification can be satisfied with an empty box! Only when there is some form of output does the system begin to make sense, and a processing load becomes real


Moral—The moral of this section is that the world's most powerful DSP system is useless without suitable, powerful I/O capabilities.

Whether the system has lots of processors and a small amount of I/O, or a small number of processors and lots of I/O, the system can always be considered as a system of nodes that require to communicate with each other in some way. Hence it can be a sensible strategy to consider the new concept of "nodes" where a node can be a processing node, an I/O node or even both.


Inter-Node Communication

Although it might not always be thought of in these terms, once you start to think of your system in terms of connected nodes, there is always communication between those nodes. The communication can consist of data flow, control flow or both, just like the concept of nodes being Processing or I/O we can think of all information flows as communications.

Figure 3: Multi-node system featuring point-to-point and point-to-multipoint (multi-cast) communications


The Justification for Efficient Communication

When we are dealing with "real world" data it does not magically appear correctly formatted at the node, as is often assumed when running academic benchmarks. It is the job of the communications system to format and present the data to the node in that way, so that it can be handled efficiently.

Usually "real world" data from I/O devices is a constant stream of data at some fixed rate, rather than a short high speed burst of data at the full bandwidth of the communications system. It is however not uncommon that data processing involves a block of consecutive data items in order to provide a single result. This is sometimes a swing buffer type processing that moves from one block to the next, or is sometimes a 'sliding window' of data, where when a new data sample is added, the oldest can be discarded. What is common to both of these techniques is that access to stored data samples is required. It is a requirement of the communications system to present the data to a node in a way that makes this possible.

Definitely in the swing buffer case, and probably in the sliding buffer case, it is inefficient for a node to be informed about the arrival or departure of each individual data item. It is preferable that the node is informed after a "block" has been transferred. As the block size is increased, the overhead of informing the node becomes negligible.

It is possible though that the node requires access to a large amount of stored data, or a small amount, and the communications must not impose a particular block size on that node. Varying block sizes in this context usually offer a trade off between throughput and latency. The size of the storage available at a particular node will also need to be taken into consideration. Each system will have its own constraints for these parameters. So for the communication of "real world" data we can see that we need:

  • Block transfer of data
  • Notification of movement of a block
  • Block size must be variable by the application.

It can be seen that if these are the requirements for processing I/O data, then as the same data is pipelined through a system, the same requirements will apply. We can therefore assert that these requirements are general.

In addition to those requirements we can suggest that a processing node should not be constantly polling and copying data, so if possible some kind of DMA transfer would be desirable. This could be less important at an I/O node as the hardware would be dedicated to transferring the data.

Also from the point of view of a Processing node, the communications system should be simple to use. This means that its use is not just efficient but also that it fits the application requirements rather than imposing constraints on the application.

It has also been assumed all along but not explicitly stated that the communications system should be able to provide a high sustainable throughput at a viable system cost. Now we have explicitly stated those as requirements.

The communications system must also guarantee delivery of data, the application should not have to be concerned about whether the rest of the system is ready to accept data or not. This implies an operation that is self blocking when a receiver is no longer ready to receive.

Figure 4: Multi-node system highlighting blocking and non-blocking communications

It is however possible to imagine situations where the blocking is undesirable, particularly in view of the next requirement.

Figure 5: Blocking and non-blocking options for multi-cast communications

It is often desirable in a system, that several nodes can act upon the same data set. This could be achieved by copying data and re-transmitting it, but if the design of the communications allowed this to happen, without intervention by the node, it would be a highly desirable feature. If this feature were to be used to transmit data from one source to two sinks, should the transmission be blocked be either of the recipients, or one of them, or neither? This can only be answered by intimately knowing the details of the application, so any of those schemes should not be imposed by the communications system. This leads to having the blocking feature as an option that can be selected as required by the application.

From the perspective of a real-time system, it is also important that the communication system be able to guarantee bandwidth. Because the communication does not know what the data rate will be, a priori, bandwidth allocation must be allocated statically or at least pseudo-statically.

Figure 6: Multi-node system using shared memory as communication medium

It should also be the case that the guarantee is independent of the application software that is running on the "system".

Now we can make two lists of requirements, the first a "must have" list:

  • Block transfer of data with notification of its completion
  • Block size must be variable by the application
  • Guaranteed High sustainable throughput
  • The option of having Guaranteed delivery (blocking communications), if required by the application
  • Viable cost
  • Simple to use.

And a second list of "would like" items:

  • Use of DMA at processing nodes
  • Support for multi-casting.

Having produced this list of requirements we should validate them against some recognisable systems:


A Simple Communicating Multi-Node System
Most early experiences with multi-node DSP systems are with simple DSP cards, having a single DSP chip, plugged into a host computer. In this system the host computer is one node, and the DSP chip is another node.

Figure 7: Two-node system (single processor DSP board in host computer)

The host computer is probably used to provide input from the user, and output to the user either via the screen or the file system. This makes this node a Processing and I/O node. If the DSP has some kind of I/O it is usually one or more memory mapped A/Ds and one or more memory mapped D/As. This makes the second node also a Processing and I/O node. We can test our list against this system:

  • Block transfer of data with notification of its completion—There are many examples where this would be required, image capture and storage would be one, transient analysis is another. However, it is hard think of an application where this would be a problem especially in view of the next item.

  • Block size must be variable by the application—In the imaging case the block size could be chosen as a line, or a frame as required by the application. For the transient analysis case a block could be the size of the stored buffer of data. Some applications though may need low latency transfer of single words so it is necessary to let the application choose.

  • Guaranteed High sustainable throughput—Most applications would like high bandwidth, none would suffer because of it. The Guaranteed availability of bandwidth is necessary though, even if the necessary bandwidth is quite low.

  • Guaranteed delivery (blocking communications), if required by the application—Following the imaging example, if images were being sent from a framestore on the DSP node, to the host computer for storage, you would need to guarantee that the blocking occurred so that a whole frame was always received. You could imagine an application where the DSP node was always producing a data stream, and the host updated a display buffer at slow intervals. If the communication was blocked in this case there could be a "stale" buffer of data in the communications system, that is of no use, whereas a non-blocking system would guarantee that the next buffer received was "fresh".

  • Viable cost—It is hard to imagine anybody disagreeing with this one.

  • Simple to use—This is another that it is hard to argue with.

And the second list of "would like" items:

  • Use of DMA at processing nodes—It is clear that while the communication might be occurring over a slow interface, such as ISA, or even a relatively fast one like PCI, the data rates achievable on the bus are unlikely to allow the communication to be synchronised with the incoming data. It is therefore likely that the processing performed on either node will be overlapping with the communication. Hence it is desirable that a DMA like mode is used for the communications, so that the processors do not have to be concerned with keeping the communications happening.

  • Support for multi-casting—In a two node system this is hard to use.


A Multi-Processor System with Single Input and Single Output Nodes
This type of system could just as easily be a control system, a data storage system or many others. For the purposes of this illustration, lets assume a system that enhances images in real time. It has a camera providing input images, and a display for outputting processed images. Perhaps several processors are needed to process the images. Consider a four processor configuration, with each responsible for processing a quadrant of the image.

Figure 8: Multiple processor system with single input and output nodes

If we assume that the image input and output devices do not contain any processing capability, this system has two I/O nodes and four processing nodes.

We can test our list against this system:

  • Block transfer of data with notification of its completion—Almost all image processing algorithms require that multiple pixels from multiple lines are processed, so it makes sense to transfer multiple lines in a block to gain efficiency of the transfer.

  • Block size must be variable by the application—It is likely that the block size used in this system should be the frame size of the image. However, if these images exceeded the amount of storage available on the processing nodes, or the processing latency exceeds the duration of a frame, a sub-image would need to be used for the block size.

  • Guaranteed High sustainable throughput—Video cameras generate a lot of data. In order to support the enhancement of the video data in real time, the communications system must be capable of sustaining a high throughput. Otherwise the performance of the overall system will be severely restricted.

  • Guaranteed delivery (blocking communications), if required by the application—If the processing capability of the system was such that all images can be processed, the communications should be operated in a blocking mode so that data is not lost. If however the processing time could exceed the time taken for the next image to arrive, advantages can be seen of letting the data transmission continue unhindered, so that the next frame processed is current.

  • Viable cost—It is hard to imagine anybody disagreeing with this one.

  • Simple to use—Again this is hard to argue with.

And the second list of "would like" items:

  • Use of DMA at processing nodes—As the processing is going to be performed on either double buffered images, or image strips, there will be new data arriving while the processing is taking place. In this case the use of DMA to prepare the next buffer for processing is the only sensible way of using the system.

  • Support for multi-casting—In this system, the captured "raw" images are presented to the four processing nodes. If the whole image was passed to only one processor, and that processor was responsible for passing the data onto subsequent processors, a copy and re-send operation would be necessary before processing could begin. The processing on subsequent processors will be delayed until the data has been re-sent. This is a clear illustration of the need for multi-casting, so that all four processors can receive all of the data. This means that too much data is received, but if the communications gives no overhead to the processor, and the bandwidth of the system is enough to transport all of the data there is no negative effect of sending too much data. In fact the "quadrant processing" technique requires that the data sets overlap so that there are no edge effects when the quadrants are re-assembled for display. In fact the display node could also be interested in receiving the original "raw" image so that it can be displayed for comparison purposes. This would mean that this system used broad-cast (where the data is sent to all nodes), which is in fact a restricted case of the multi-cast (where only certain nodes require the data set).


A Single Processor System with Multiple I/O Nodes
This type of system could just as easily be a control system, a data storage system or many others, but for the purposes of this illustration lets assume a system that is monitoring safety levels of multiple inputs. It has several input nodes all of which feed data to a common processing node.

Figure 9: Multiple I/O nodes connected to a single DSP node

This processing node could be comparing inputs, or testing each input for a spurious signal. This system has multiple I/O nodes and a single processing node.

We can test our list against this system:

  • Block transfer of data with notification of its completion—It is quite likely that a comparison should take place over a long data set before an alarm is raised, or that a pattern is searched for in a series of samples. This implies processing on a block by block basis.

  • Block size must be variable by the application—It can be seen that the actual block size that would be most efficient for this system is entirely dependant on the algorithm used in the processing.

  • Guaranteed High sustainable throughput—Most applications would like high bandwidth, none would suffer because of it. The Guaranteed availability of bandwidth is necessary though, even if the necessary bandwidth is quite low.

  • Guaranteed delivery (blocking communications), if required by the application—It is quite likely that a monitoring system would want to view every available data item in order to guarantee detection of the fault condition. This would require that the arrival of data is temporarily halted if there is a reason that data cannot be accepted. On the other hand the application might dictate that the data is only tested by taking a snapshot at regular intervals, in which case blocking would provide stale data that would have to be flushed before starting to process a new snapshot.

  • Viable cost—It hard to imagine anybody disagreeing with this one.

  • Simple to use—Again this is hard to argue with.

And the second list of "would like" items:

  • Use of DMA at processing nodes—A continuous monitoring system would require that new data is being received while the current data is being processed. A snapshot processing system would not necessarily need this feature, but it probably could be used to enhance the system performance.

  • Support for multi-casting—In this system, there is no requirement for multi-casting as there is only one recipient of data.

Having established that the requirements we have identified seem to be valid for common system configurations. Now we can start to examine some of the traditional techniques for inter-node communications in multi-node DSP systems.


Shared Memory
Most peoples' first experiences of multi-node DSP systems use shared memory as the interface between the nodes. This involves using an area of memory that is addressable by all nodes.

Figure 10: Multi-node system using shared memory as communication medium

As a hardware solution this is quite easy to implement, needing some memory chips, and an arbitrator that will ensure that the accesses from the nodes will not clash. The arbitration can use several different techniques, which allows one of the nodes to have priority over the others, or a "round robin" system that guarantees that all nodes get a chance to access the memory.

From the system software point of view, there are some difficulties that need to be taken care of, but in a small system they are usually accepted by the programmer. When a communication is sent from one node to the other, first the node must locate an area of memory that is not currently in use by either node, requiring some kind of buffer allocation mechanism. Next the transmitting node must write the data to the memory. The receiving node needs to be informed that the buffer is filled and waiting. Although an interrupt would be the most efficient way of doing this, this is not always provided by shared memory designs. Even if an interrupt is provided, there also needs to be some kind of mailbox to allow the recipient to tell what to do with the contents of the buffer, and where the buffer is stored. This mailbox can be implemented in the shared memory, or a separate address location could be dedicated to this purpose. It can be seen that there is quite a large amount of software management that needs to take place to use this type of system, and it needs to be consistent on all nodes that share the resource.

From an efficiency point of view, there are a number of important issues to consider:

  1. The access speed of a buffer depends on the arbitration method, but will always be less than 1/nth of the maximum access speed of the memory devices ( n = number of nodes). This is because in any communication both the transmitting and the receiving node requires access to the memory in order to complete the communication, and some finite time will be 'lost' in order to arbitrate the accesses.

  2. For any one communication there are many additional accesses required in order to gain allocation of the buffer, flag that it is queued for use, find out its purpose and finally to acknowledge the action is complete and to release the buffer. This makes the use of small buffers very inefficient.

  3. The size of the buffers used in the communication is limited by the size of the shared memory resources, and the maximum number of possible concurrent communications

What if we look at how well it meets our needs for a communications system.

From the "must have" list:

  • Block transfer of data with notification of its completion / Block size must be variable by the application—These are both possible using the shared memory approach, given the correct management software for the system.

  • Guaranteed High sustainable throughput—This is usually acceptable for a small system with a few nodes, but it deteriorates rapidly as more nodes are added. The need to perform multiple accesses in order to "manage" the communications also severely reduces the performance. It can also give rise to situations where the resource is not guaranteed to be available.

  • Guaranteed delivery (blocking communications), if required by the application—This can be implemented given the correct management software.

  • Viable cost—This used to be true of shared memory systems, but as the required system speeds grow it becomes expensive to implement the arbitration. Some shared memory systems on the market today are adding complete extra processors to perform the arbitration. Surely this is a high percentage of the cost of the board, and who programs it? Do you need a second compiler and run time license for it? As more and more nodes are added, or larger communication blocks are used the size of the memory required grows rapidly. Usually the need for fast access and large capacity are mutually exclusive, and useful compromises are expensive.

  • Simple to use—Nobody can say that shard memory is easy to use, even a small system with two nodes generates a long list of issues that must be handles when managing a communication, just think what it would be like with six or more processors. Imagine having a small error in the management software that occasionally allows an unread communication to get overwritten. How do you locate the cause of this error? If your system seems to run can you guarantee that the management algorithm is error free in all timing cases?

And a second list of "would like" items:

  • Use of DMA at processing nodes—The main body of the communication could be transferred using DMA, but there is an obvious processing load on the node in order to arbitrate and manage the communications. This would require that the node stop its processing task in order to perform the management tasks required before it can start the new DMA and return to the processing.

  • Support for multi-casting—This is simple using the shared memory approach, probably its strongest point, but its implementation requires even more complex management of the shared memory.

Additional comments:

  1. Shared memory is only practical in a single board system, how could you share memory over multiple boards?

  2. How simple is it for an I/O node to exist in a shared memory system? There would need to be hardware arbitration for the buffers, thus fixing the algorithm used by the system, or an additional processor added to perform this task.


Bus Communication Systems
Sometimes architectures are considered where a shared Bus is used for the communications path. This could be separate boards communicating over a back plane like VME, or multiple nodes on a board communicating over an internal bus.

Figure 11: Multi-node system using shared memory as communication medium

Considering our requirements, the "must have" list:

  • Block transfer of data with notification of its completion / Block size must be variable by the application—These are usual means of bus communications, but require that the node has implemented it this way.

  • Guaranteed High sustainable throughput—This can be the case for a small system, or even in a larger system with a single communication. However, the shared nature of the bus resource means that the bus bandwidth is shared across the whole bussed system. This makes it hard to implement a pipeline of data flow through the system, and makes the transmission time of a given communication very dependent on the status of the rest of the system. As a result, it is possible that the bus resource may no longer be available to a large part of the system, while one part of the system "hogs the bus".

  • Guaranteed delivery (blocking communications), if required by the application—This is part of the bus specification, for example VME is a fully asynchronous bus, which implements a blocking algorithm, but ISA simply assumes that all of the data it presents will be accepted by the receiving node. A proprietary bus design could choose which to use, but a very complex system would be needed to support the choice of blocking or non-blocking communication by the application.

  • Viable cost—A standard bus is usually inexpensive to implement, as standard interface components are often available, the design of a proprietary bus would have to consider the cost issues of each design decision.

  • Simple to use—A bussed system is quite easy to use, apart from the management of the uncertain delivery times.

And a second list of "would like" items:

  • Use of DMA at processing nodes—This could easily be implemented with most bus systems.

  • Support for multi-casting—Again this feature depends on the bus specification but it is definitely possible.

Additional comments:

  1. Bus based systems are quite common, especially where the nodes are each on a board, and the bus is on a back plane.

  2. Often people will struggle to convert the node interfaces to match a standard bus architecture, simply because it is a standard, despite the inefficiency of such a mapping.


Communication using FIFOs

Figure 12: FIFOs support transfers of data between connected nodes

Sometimes system designers realize that shared memory management algorithms are often attempting to implement a FIFO in software.

Figure 13: Nodes communicating via FIFO buffers "appear" as point-to-point communication channels

The FIFO flags give the information about the availability or non availability of space, or data to read, and the two ends of the FIFO can be read and written simultaneously. Some multi-node implementations have used this technique, where a system has a dedicated FIFO placed between two nodes that need to communicate. Each end is accessible by only one node, so the FIFO offers a point to point connection between the nodes.

Judging against the "must have" list:

  • Block transfer of data with notification of its completion / Block size must be variable by the application—These can be achieved if the interface between the node and the FIFO is properly implemented, if not the block size could be governed by the FIFO size.

  • Guaranteed High sustainable throughput—The resources are guaranteed, and the bandwidth is assured by the fact that each connection is a dedicated point to point connection.

  • Guaranteed delivery (blocking communications), if required by the application—The FIFO flags enable this to be implemented as desired.

  • Viable cost—FIFOs are available in a wide variety of speeds and sizes, from a number of different manufacturers, all at low cost. The hardware interface to them is simple to use so is not expensive to manufacture.

  • Simple to use—This interface is probably the simplest interface you could have.

And from the list of "would like" items:

  • Use of DMA at processing nodes—DMA can be used to transfer data to and from the FIFOs, either throttled in hardware when the FIFO flags indicate a full or empty FIFO. Alternatively, the flags can be used to trigger a DMA transfer of one complete FIFOs worth of data without the need to re-inspect the FIFO flags.

  • Support for multi-casting—This cannot be achieved without the node copying and re-transmitting the data.

Additional comments:

  1. Against the requirements that we started with this is the best fit yet, but...

  2. In a system of many nodes, which requires connectivity between all combinations of the nodes, the number of FIFOs becomes large.



    Figure 14: TMS320 processors that support the FIFO point-to-point communications model

  3. The FIFO approach is in fact the same approach that Texas Instruments used in their 'C4x DSPs to implement the Comports. This is the basis of the architecture of our 'C4x products.

It is an interesting aside that many early 'C4x products used shared memory and Comports, whereas we use comports as the only available inter-node communication, even with the host machine that we have already discussed as being just another node. It is also interesting that we are one of the few vendors that recognise the concept of having a node in the system that provides I/O only, as is demonstrated by our GDIO range of I/O products.

Almost all of those early products on the market that used shared memory have converted to be comport only products now, perhaps because of customer pressure, or perhaps because when those companies started to use the comports they realised what an elegant solution they were.

With our 'C4x products today, we see four common additional needs:

  1. It is possible to require more connections than are available on the processor silicon.

  2. When more connections are required than can be supported by the hardware, it is sometimes "desirable" is to utilise a means of using intermediate nodes to forward data.

    N.B. This is an established practice in more "general purpose" multiple processor architectures. However, unless utilised conservatively, it is possible that the guarantee of bandwidth availability either becomes a "grey area", or no longer possible at all. Under such circumstances, care must therefor be taken to avoid violating the constraints for a real-time system implementation.

  3. The multi-cast is a requirement of some systems



    Figure 15: Back of envelope calculations for possible high bandwidth applications

  4. The bandwidth is often too low for the application, especially when the node is required to forward data onto a subsequent node.

It is clear however that in the modular multi-node system business that we are in, the comports are the most elegant solution available today.

We can now propose a possible alternative that has significant advantages over today's systems:

  • The architecture is independent of any particular processor type (unlike the 'C4x comport), so we can use it for our existing 'C4x product range and for our 'C6x product range, where there is nothing like the comport provided by the silicon manufacturer.
  • It can in fact be seen as an advantage that the processor type is not imposing a particular communications strategy on us. It means that philosophy being used in the design of these systems today will be re-usable in the future regardless of the format of the processor or I/O requirements.


A Novel New Architecture: Why Has Nobody Done It Before?

Figure 16: Pair of nodes communicating via FIFOs connected to "transparent" network

Taking on board what we have been discussing, it is clear that having a FIFO interface at each node is an elegant way of inexpensively and simply connecting to the communications system. What would make this even better would be if all communications could be routed through a single bi-directional FIFO regardless of the number of nodes in the system, regardless of the ultimate source or destination of the communication. This requires that the other end of the FIFO can have its data routed to any other node in the system.

Figure 17: Pair of nodes communicating via FIFOs connected to ring

If we propose that the communications system takes the input and output of the FIFO and chains them together to form a ring, made of point to point connections, then there is a possibility to reach any other node in the system.

Taken simplistically this means that each node has to handle the data for all other nodes in the system. This is clearly undesirable so the communications system should be able to automatically re-route communications that are not destined for receipt by this node, directly to its output connection. If the system is clever enough the node could receive and re-transmit the data, providing support for multi-casting.

Figure 18: Group of nodes communicating via ring

An issue that needs to be carefully thought out is latency. Too many nodes in a system means that undesirable latency occurs for nodes that are a long way around the ring. In network language, the latency will be directly related to the diameter of the network, where the diameter is the maximum distance between any two nodes in the network.

Hence, there should be some means of reducing the diameter of the network. Probably the neatest solution for this problem is to have a maximum number of nodes in a ring, thus ensuring a fixed maximum latency around that ring, and to have several rings connected via a secondary ring. Then the maximum latency around each ring is deterministic, and more layers of rings can be added if the system becomes large enough to need it.

If it is not carefully thought out, clashes can occur between data that this node wishes to transmit, and data that this node needs to pass on. There must be some mechanism for ensuring that these separate communications cannot cause deadlock. It is a requirement of a real time system that they can both proceed at the same time. This allows us to achieve our need for guaranteed bandwidth availability for each data path in the system. Thus it is necessary that at any point on the ring, multiple communications can be occurring simultaneously.

In the case of the ring communication system, it must be possible to pre-allocate "channels" which will then be utilized by the system.

Figure 19: Support for separate concurrent "channels" across communication system

Such a "channel-circuit switched" model guarantees bandwidth (just like two people communicating via a telephone network—they are allocated a connection and guaranteed bandwidth).

The alternative would be a datagram / packet switching network. This definitely does NOT guarantee bandwidth, it actually creates a competition for resource and is therefor susceptible to undesirable networking problems as variable, non-deterministic performance, and in extreme cases, unexpected hot-spot contention (N.B. Hot spot contention can occur at any point in a packet switching network. When it does, a cascade of contention will extend across the entire system and grind it to a halt. A good analogy for the packet switched model is the Internet, which operates by allowing IP packets to be submitted and received. Each IP packet includes address field that allows it to propagate across the internet. However, the speed of propagation depends on the load of the internet at any particular time. Would you want to build a real-time application across the internet? So why would you want to build one using packet switching? Obviously the key supporting issue is "real-time".

To the application, the communication system should simply be there, and works deterministically, so that the application doesn't need to worry about it. Using a pseudo-static allocation of the communications resources at system design time enables the application to simply use the system, and be guaranteed the availability of bandwidth.

We are not going to tell our competitors how we have done it, but we have in development a system with the following features:

  • Synchronous FIFO connection to nodes, offering DMA support and interrupts from flags.

  • Six nodes on an "on board" ring, including the host interface, and the inter-board connection.

  • Up to 264 Mbytes per second communication, allocated in 44Mbyte/second increments (progression path to 400Mbytes/sec in 66.6Mbyte increments when FIFO speed allows).

  • Application level programming of routing paths, dynamically configurable during system operation.

  • Application level selection of blocking/non-blocking communication.

  • Application level selection of multi-cast groups.

  • No imposition of block size on the application.

  • "Inter board" ring, connected via 14 way connectors (one for ring input, one for ring output) that supports the same bandwidth as the "on board" ring.

  • Guaranteed maximum latency of three 32-bit words in the "on board" ring, plus three more for the off board ring.

  • Guaranteed deadlock free.

Figure 20: Board-level building block template for ring system

Static or dynamic routing of connections, thereafter no routing information is required.

We are currently designing commercially available 'C4x and 'C6x systems with this communications system and fully expect to see other major players in the same business follow suit soon! Figure 21 shows a possible ring based system solution for our original "problem"


Requirements That This Architecture Places on a DSP Chip

Any system made using the architecture described places very few pre-requisites on a DSP chip, making it a general purpose communications strategy, but there are a few simple things that will operate with the architecture to enable even better systems to be made.

It is a big advantage that the described architecture does not really require anything from the DSP chip that is specific to the communications strategy. There is no need for memory arbitration logic, no need for many FIFO connections built into the processor, no unique and new requirements.

However, it is clear that connecting communications in this way requires efficient access to the communication FIFOs. This access should hopefully be DMA driven, so the processor architecture should include uncommitted DMA engines. These DMA engines should be able to access the FIFOs without blocking other peripherals that might be accessed in the same way, when the communication is blocked.

The use of a single memory bus for connection to this communications system AND the system memory would cause some interaction between the communications and the system memory.

Figure 22: Preferred model for DSP building block for multiple processor systems

This leads us to prefer that the processor has some means to access this communications system, independently of the system memory. This really leads to the requirement of a second memory bus such as on the 'C4x DSP, but it could be implemented just as well using separate input and output connections to the communications system.

The good thing about this communications strategy however is that it does not REQUIRE two memory busses, and it could be implemented just as well using a single high performance memory bus such as the one on the 'C6201 DSP.





Please sign in to post comment

Navigate to related information

EE Buzz DesignCon

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form