The telecommunications industry’s continuous strive for higher performance has spurred innovations in processor architectures. The general trend has been to go parallel; adding more cores to a single processor device and then dividing tasks between them. This has resulted in a more complex environment for software engineers to master. But does this mean that the programming of next-generation network processors (NPUs) has to be difficult? Not necessarily.
Some of the complexities that emerge due to the parallel concepts in modern processors were recently addressed by Sebastien Maury and Dr. Peter Robertson in Embedded Technology on March 1, 20101. Their conclusions, among others, include that processors should be kept simple and that shared memory basically should be avoided due to the complexity of memory consistency and cache coherence. We agree with this view when it comes to special purpose processors. In this article, we will provide an example of an architecture that has taken a different approach to parallelism, one which keeps the programming model of the uni-processor intact and utilizes resources very effectively.
When the Task is Hard
The processing demands on modern NPUs are very high. In an application designed for 100 Gbps processing, the NPU must be able to handle 150 million packets per second. In such an application, thousands of packets are typically being processed concurrently by the device. The amount of parallelism is extreme compared to any other application in the IT industry. Another unique attribute of NPUs is the demand for extremely high table memory lookup rates.
In packet processing, every network service (User and control/OAM traffic) requires a unique set of operations per packet (classification, filtering, counting, metering, policing/shaping and forwarding). A network service may require hundreds or even thousands of operations before they are eventually forwarded to outgoing interfaces or to the conftrol CPU.
With the networking industry’s unique set of performance requirements, next generation NPUs are designed to solve very specialized problems. They don’t compete with general-purpose CPUs, but offer a programmable alternative to in-house developed fixed-function ASICs.
The Shift Towards NPUs
There is currently a shift toward merchant silicon in high-end networking, mainly driven by the ability to shorten time-to-market and focus research and development (R&D) expenses on differentiation through software rather than through more risky ASIC designs.
As ASIC designs are shifted out in favor of NPUs, some R&D managers raise concerns regarding the complexity of these devices. Are they difficult to program? Can performance and an intuitive uni-processor programming model be combined?
In 2004, Larry Huston of Intel (at the time Intel was a main player in the NPU space) ended a paper with a statement which carries as much meaning today as it did six years ago2:
“The ideal scenario would have a programmer write an application as a single piece of software and the tools would automatically partition and map the application to the set of parallel resources. This may be a difficult goal, but any steps in that direction will improve the life of a developer.”
The programming model of any processor should mirror well-known programming concepts in order to limit engineering resources. A software engineer wants to:
- Use the uni-processor programming model as it is easy to use and debug.
- Utilize the processing and memory lookup resources as effectively as possible.
- Ensure data types are identically defined between different processing parts of the system (e.g., between the controlling host CPU and the NPU).
How NPU vendors go about achieving these goals differs, but there are two main methods: by the design of the architecture, or by software tools.
Tools Don’t Address the Real Problem
NPU vendors often provide a comprehensive software development environment for their clients. There are tools and libraries to ease boot and configuration as well as to provide a simplified abstraction of hardware resources. This aims to enhance the utilization of critical resources.
In addition, software tools can hide the complexity associated with multithreaded programming from the software engineer, making the processor look like a single sequential processor. While this is generally a good thing, there is one caveat. When the programmer starts to run out of resources, he or she must look into the details of how the architecture actually carries out the tasks given by the program. The programmer must then hand-optimize the code in order to achieve the performance and functional targets; optimizing multithreaded programs is indeed difficult.
Where Programmers Spend Their Time
Most software engineers working on line card designs tend to spend most of their time in an endless iteration of performance tuning, as illustrated in Figure 1. This struggle often stems from the need to:
- Free up resources to enable additional features and network processing tasks.
- Obtain the right level of performance.
Wirespeed performance, or the ability to process all incoming traffic at the maximum speed of the interfaces, is a key design goal in datacom equipment designed for the carrier market. For NPU architectures based on multi-stage parallelism, performance is dependent on load and traffic types, forcing the programmer to spend most of his or her time and efforts on testing and re-engineering the code to achieve the performance targets for at least the most probable and common traffic scenarios. This is indeed a difficult task, as is simply having to determine what traffic will dominate. Look at the recent history of the Internet evolution. Peer-to-peer traffic and video streaming have disrupted the traffic patterns in broadband networks. What is next? This is why most equipment vendors strive to achieve wirespeed processing for all types of traffic.
To address this never-ending performance optimization process, a fully deterministic architecture for the NPU market has been developed. It is referred to as the dataflow architecture.
The dataflow architecture takes a unique approach to parallelism and simplifies the programmability of next-generation network processors. It features a single pipeline of processor cores. Figure 2 provides an overview of its main components, including:
- Packet Instruction Set Computer (PISC) Processor Core: Power and area-efficient processor cores specifically designed for packet processing. The programmable pipeline consists of hundreds of identical processor cores, enabling software rearrangements and a simple uni-processor programming model.
- Engine Access Point (EAP): Specialized I/O units for table lookups, metering, policing, filtering, and EAPs unify access to tables stored in embedded or external memory. Moreover, EAPs include access to hardware acceleration engines (e.g., for hashing).
- Packet Execution Context: Packet-specific data available to the programmer; the execution context is uniquely associated with each packet and follows the packet through the pipeline.
Packets navigate through the pipeline in a first-in-first-out (FIFO) manner, shifting one stage ahead in each clock cycle. Every instruction can execute as many as five operations in parallel in a Very Long Instruction Word (VLIW) fashion before continuing to the next processor core or EAP. Instruction memory for the data plane program is located in each processor core. This eliminates having to retrieve instructions from a shared memory during program execution, thereby avoiding any conflicts while optimizing performance and power dissipation.
This architecture’s model enables programmers to write sequential modules and avoid memory consistency, coherence, synchronization, and other hassles of multi-processor programming. In addition, the dataflow architecture and its programming model enforce wirespeed operation, providing every network service and type of packet a guaranteed number of operations and classification resources.
The Programmer in the Driving Seat
The dataflow architecture allows the programmer to use the well-known uni-processor programming model. That is, a single sequence of operations can be used to implement a network service, which is the intuitive model familiar to all programmers. Having previous work experience with NPU architectures is not necessary.
By eliminating the need for performance optimization and ensuring great utilization of available processing resources, the dataflow architecture provides a new paradigm for NPU programming. The industry benchmark is to manage Carrier Ethernet data plane development projects with a team of 10+ people for 10 to 18 months. This can be heavily rationalized when basing the R&D on the dataflow architecture. Experience shows that a well-managed project requires two dedicated and skillful engineers for five to seven months. R&D productivity increases ten-fold and time-to-market can be shortened by several months.
Parallel concepts in the semiconductor industry have made it more difficult to master performance; however, there are architectural approaches that simplify the programming model. The dataflow architecture designed for high-capacity network processing achieves this by providing the uni-processor programming model and guaranteed wirespeed processing.
This article was written by Per Lembre, Director of Product Marketing, and Håkan Zeffer, Senior Systems Architect, Xelerated (Stockholm, Sweden). For more information, contact Mr. Lembre at
- Tenth International Symposium on High-Performance Computer Architecture