Design Article
A practical programming model for traffic management
Vinoj Kumar
8/18/2003 9:59 AM EDT
Writing embedded real-time software generally follows the 80/20 rule - you spend the majority of your time writing a small piece of critical real time code. In the networking world, the embedded portion of the application is called data-plane code and the remaining portion is called control-plane code.
Data plane code is small but requires large amounts of computing power. In contrast, control plane code is huge but requires considerably less computing power. Networking application developers spend the majority of the time and effort debugging and optimizing the small piece of the time critical data plane code.
There are several critical design choices a software designer has to make in designing applications that use optimized processing elements such as network processors and "traffic managers". The designer has to carefully balance the requirements of throughput and latency versus application scalability. Also, given the high data rates in a networked environment, multiple engines working in parallel are replacing traditional sequential programming on a single engine. This trend is especially true in network processors.
Network processors and traffic managers typically implement a combination of parallel and pipeline models to achieve high data rates.
Parallel architectures boil down to two fundamental models: replicated parallelism and pipelining. In a replicated parallel model, each processing unit is replicated several times. Each processing unit takes a packet and performs all the functions before picking up another packet for processing. In a pipeline model, each processing unit executes a part of the function.
Whether the model is replicated parallel or pipelined, these types of architectures raise several important issues related to the programming model. Foremost among them - should parallelism be exposed to the application programmer or managed by hardware and the programming language?
If the parallelism is exposed to the software (sometimes referred to as explicit parallelism), the hardware architecture exposes the nuances of parallelism to the software.
One of the challenges in developing code for exposed parallel architectures is writing efficient code that can keep all the processing units busy while avoiding stalls. Even though a high-level language is used, you typically end up analyzing the assembly code and looking at ways to optimize. Subtle changes to the structure of the C code can sometimes greatly affect the generated code.
Expressing and managing concurrency is the other challenge faced by the embedded developer. In these types of systems, you typically don't run embedded operating systems. This means the embedded developer has to deal with several concurrency management issues such as assignment of programs to processing units and management of their coordination. There are three broad areas of concurrency that raises issues to the data plane software developer: thread creation, assigning threads to tasks, and the context switch model.
There are several models of thread creation, some dictated by the underlying hardware architecture. Examples of thread models include many threads mapped to a single virtual processor, one thread per virtual processor, and many threads to many processors. For example, the pthread (POSIX Thread) model supports the system-scoped many-threads to many-processors model. Exposing the thread creation model to the application has a major drawback - applications written to one model are not portable across architectures with a different concurrency model.
A related issue to thread creation is the mechanism of assigning threads to tasks and there are several models that a network processor application can implement. Examples of thread assignment models include thread pools, Pipeline and thread pool per protocol layer.
Once the developer figures out the thread creation model, the context switch model has to be chosen - a thread goes to sleep on a mutex variable, higher priority threads pre-empt lower priority threads or a thread yields to another thread of the same priority. Developers may be constrained by the underlying hardware support for the context switch.
An architectural model that hides parallelism from the application, sometimes referred to as implicit parallelism, on the other hand, offers several advantages. Architecturally, it can yield to interleaved threading as opposed to block threading. In block threading, only one thread executes in the pipeline, and the pipeline has to be flushed on a context switch. In an inter-leaved threading, each pipe stage can be running its own thread, thereby greatly increasing processor efficiency.
From a programming model standpoint, hidden parallelism can offer many advantages. One can design a functional language that requires a programmer to specify "what" to accomplish without specifying the algorithm to accomplish the task. This contrasts to languages designed for systems built for exposed parallelism.
Imperative programming models designed for explicit parallel architectures require a programmer to specify "what" to accomplish and "how" to accomplish a task. In essence, the programmer specifies a sequence of instructions to accomplish the task. And, in general, imperative languages tend to provide explicit access to low-level primitives such as semaphores and mutexes. Such languages also expose hardware architectural elements such as concurrency model to the programmer. If the architecture implements parallelism, a language that hides the parallelism is always easier to program. Otherwise, the programmer has to explicitly map tasks to functional units. By contrast, architecture with implicit parallelism allows a compiler to choose the best level of optimization.
So, if hidden parallelism offers advantages, how does one express abstraction? In other words, a language specification is needed that can express functions without exposing the hardware concurrency model. Informal mechanisms such as natural language programming (NLP) are imprecise and error prone. Formal mechanisms such as Regular Expressions, Finite Automata, Context Free Grammar, and Backus-Naur Form are more precise and can be easily implemented.
Regular Expressions are more suitable for string matching where the match either succeeds or fails. It does not work very well where actions are associated with pattern matches. Context Free Grammars have been used to design languages targeted at specific applications. ErLang is one such example targeted at communications applications.
ErLang is an example of a Context Free Grammar that has many features (concurrent processes, distribution, and scheduling) more commonly associated with an operating system than with a programming language. Implementing sophisticated features of ErLang, such as message passing and memory management, requires an OS kernel running on the processor. Although it does provide a high level of abstraction, ErLang is more suitable to the development of complex distributed systems rather than tight, data intensive data flow applications.
Backus-Naur Form (BNF) has been used successfully to express certain classes of network functions. Unix parse tools such LEX, a popular lexical analysis tool, and YACC, a parser generator, have been used successfully for several years. Both LEX and YACC are based on BNF. A language based on BNF designed for an implicit parallel architecture can offer the right balance between flexibility and hardware abstraction.
A BNF-based language can unambiguously describe classification syntax and is well suited for classification.
With a functional language implemented on an implicit parallel hardware, data plane code can be written as though it runs on a single thread in a uni-processor system. The underlying hardware creates and assigns contexts to packets, runs the program, and manages all the contexts. The program is expressive and concise, which leads to modular application design.
In a sequential model that exposes parallelism, the embedded software developer has to describe how to implement the function and also has to explicitly allocate tasks to threads. In other words, the developer has to implement a task-dispatching loop. Depending on the system architecture, the overhead involved in executing the dispatch loop may not be trivial. Similarly, if multiple threads need to access shared state information (e.g. to run packet or cell policing calculations), these accesses need to be explicitly synchronized to provide correct behavior.
Because a functional language does not specify the order of execution, functional programs are easy to program and less error prone. LEX uses a functional language to describe and parse patterns. Both LEX and YACC, based on functional languages, have been used successfully for several years; demonstrating that a functional language is easier and more effective in expressing classification functions than an imperative language such as C.
There are other benefits to the abstracted functional programming model. For applications that require new features to be added over time, modularity and code maintenance are of utmost importance. Because an abstracted language is more concise and expressive, it naturally reduces the total lines of code, which in some ways leads to faster development time and fewer bugs. Although code line count is certainly not the only measure to quantify development and maintenance costs, it is certainly a key indicator. Writing in assembler or microcode appears to give the software developer the utmost efficiency. Yet, you will run into issues with debugging, modularity, and code maintenance, given the low-level nature of the languages.
Support for modularity is also an important language issue. As is well known in the realm of software engineering, modularity refers to the ability to design and develop self-contained, independent modules with well-defined interfaces. To build modular code in the data plane, an embedded language must support features that allow the composition and communication of independently built modules. Language options that allow such a facility include globally shared registers, and hardware/software event mechanisms to share data, such as port identifier or interface identifier. The managing of explicitly parallel processors is made even more complex in the face of modular programming.
Embedded programmers face a difficult challenge in balancing modularity requirements versus code efficiency. Given the tight space and data rate constraints, a language supported by modularity constructs such as protocol layering, exit function idiom and late binding of names to registers allows these conflicting requirements to be satisfied. A functional language satisfies these requirements.
Vinoj Kumar, Software Architect, and David Sonnier,Director of Product Architecture Agere Systems, Austin, TX


