Systems engineers face a number of challenges when configuring and programming a complex, heterogeneous multiprocessor system of the type often used in digital signal processing applications. These include:
- How to size the system — what compute resources will it require?
- How to check if the system is configured as it should be?
- How to maximize the performance of the algorithms?
- How to move data around the system across widely differing interconnects?
- How to map the application onto the system?
- How to see how the application performs in real-time?
- How to rescale the application to a different sized system without a major rewrite?
- How to migrate the application to new processors and interconnects?
Modern toolchains are now available that are capable of helping to address these issues, allowing a significantly reduced time to solution and, thus, more rapid deployment of the application.
Sizing the System
Typically, the systems engineer has a good idea about what the application core needs to do. The engineer may even have a C code prototype. How can this be used to calculate the compute resources that will be needed to run the application in real time?
One approach is to develop a tool that can model the compute performance of different processors at different clock speeds. Feeding the code snippets into this tool can allow a reasonable estimate to be made of how the algorithm will perform on a PowerPC or an FPGA, for example. In this way, the engineer can readily decide what processor types match which algorithms best, and what processor and/or FPGA resources will be needed to perform the required task in the timeframe allowed...a timeframe that is typically dictated by the incoming data rate.
The more sophisticated the tool, the better the analysis. It should, for instance, model the arithmetic logic unit (ALU), memory, and multiple layers of caches. The outputs can be defined as “costs,” which can be manipulated — compute resources, power, etc. Such a utility, which was initially developed by GE Global Research as a research vehicle, has begun to find application in the real world, helping to size systems for proposals.
One recent response to a request for proposal indicated a requirement for 72 FPGAs and 24 dual core PowerPCs. To size such a system based on a breakdown of the core algorithms manually would be a gargantuan task. Use of an analysis tool, such as the one described, not only made the sizing process quicker and almost certainly more accurate, it also provided verifiable evidence that the system as proposed was appropriate for the target application.
Is the System Configured As It Should Be?
The development system has arrived, all the boards have been plugged in, and all the interconnect cables have been wired up. Now it is time for the engineer to establish whether all the jumpers are set correctly, boards are in the right slots, interconnect cables are installed correctly, etc. This can prove to be a laborious task without some form of automation. A tool which can probe all parts of the system and display the resultant configuration information graphically can be a boon; without such a tool, it can take a great deal of time to log into each node, extract the configuration and discovery data, and correlate the data across the system. Figure 1 shows how such a tool might look.
Maximizing Algorithm Performance
There are multiple facets to this challenge. First, it should be said that an ideal tool should automatically identify areas for improvement. This may include identifying vector operations that could be replaced by functions from a highly optimized library or IP core suite. For multiprocessor and multi-core systems it may be able to identify code loops that are candidates for parallelizing. Second, it is essential that there exists a highly optimized library to leverage, whether that be automatically from the tool, or manually by replacing math loops with functions that are coded to use the Single Instruction, Multiple Data (SIMD) engines that are commonplace on today’s processors, or IP blocks to implement common signal processing functions in FPGAs. Third, the tool should provide some method to distribute algorithms across the multiple cores of a modern device.
Moving Data Around the System A typical digital signal processing system will include multiple, multi-core processors on several boards. The processor mix is likely to include General Purpose Processors (GPPs) and Field Programmable Gate Arrays (FPGAs). Each data path can be different. For example:
- Thread to Thread — shared memory
- Node to Node — PCI, PCI-Express or
- Serial RapidIO
- Board to Board — StarFabric,
- Serial RapidIO
Each method has a different mechanism and programming interface — meaning the programmer must understand the low level hardware details. In addition, rescaling and remapping of the application to the hardware may involve extensive recoding. A robust InterProcessor Communications (IPC) library should make this detail transparent to the developer, although not at the expense of performance.
Mapping the Application Onto the System
The high level system design would have partitioned the application into discrete tasks. Now comes the chore of allocating the tasks to physical compute resources. Traditionally, with a real time operating system (RTOS), this can require much manipulation of console windows, scripts and typing. For example, to load and run applications on 32 processors can take 224 mouse clicks and 128 key strokes using typical RTOS tools, whereas an appropriately configured high level tool can achieve the same result in as few as three clicks, with no typing.
Worse, the placement of these tasks can change the interconnects between nodes, and thus require code changes unless the tool suite allows for positionless communication strategies as previously described. Far better would be a suite that allows for task placement based on a user choice of allocation by processor type, by board type or automatically. Even better than that would be a scheme that allows for task replication by formula based, say, on the number of boards. Thus, if a configuration is generated with 2xN instantiations of a certain task, the system could be made to rescale based on a revised board count. If the tools include the self-discovery feature described earlier, this rescaling can be fully automatic. This, tied to a positionless communications scheme, can lead to a highly flexible, scalable configuration tool.
Figure 2 shows how a tool might represent tasks as circular objects. The windows above the graphic are used to set the number of instantiations of each task and their allocation to compute resources. The interconnecting lines show data flow paths between the tasks. These paths include simple point to point transfers, and some more complex data manipulations such as scatter, gather and all-to-all. Once a configuration has been set, a single click generates the source code modules to configure the communications library automatically, potentially saving hours of coding.
Ensuring Real-Time Performance
With a typical RTOS toolchain, there are several utilities available to profile code performance one task at a time. Where the system involves many processors, all interacting in a data-dependant manner, viewing how the system responds in real time can be problematic at best...and virtually impossible at worst. A useful toolchain should allow for event profiling across the system with closely controlled time skew between the multiple free-running clocks. Look for a global clock passed between boards across the backplane; this will allow time skew to be controlled to the order of 100ns. If this is not available, and an alternate, coarse-grained mechanism is used (e.g. NTP) then the skew can be orders of magnitude larger, and reduce the usefulness of the instrumentation. It should allow the engineer to evaluate, in real time, processor cycle and memory loading as well as interconnect performance.
Figure 3 shows a tool that gathers runtime data from the target system and displays the data in an easy to understand, intuitive manner. For instance, when the cursor is placed over an interconnect path, a pop-up window lists data about that link — interconnect mechanism, source and destination tasks, current, average, maximum and minimum throughputs and so on. This view allows an engineer to easily visualize system hotspots for processing and data movement. Armed with this data and the configuration utility, it is quick and easy to change the allocation, and then rebuild and retest.
Rescaling and Migrating
All long-term deployed programs face the same issue — providing ongoing support throughout a lifetime that might extend to perhaps ten, fifteen or twenty years. For some programs, it is realistic to make a lifetime buy of components and store those components in an inert environment — assuming the vendor is equipped to deliver this level of support. For others, this is not practical; planned technology insertion must be considered. With the rapid developments in processor silicon and the adoption of new interconnects, this can result in major rewrites of the application unless the software suite can effectively isolate the code from the platform. Standard libraries go some way toward addressing these issues, but generally some higher level of abstraction is needed to fully shield the application from the hardware. If done correctly, both rescaling on the same architecture and re-hosting to a new architecture should become substantially easier.
Closing the Technology Stack Gap
A common goal these days is to enable engineers to focus their attention on the areas in which they are domain experts — i.e. the application. Effort expended on understanding and adapting to the low-level details of the platform can be regarded as wasteful, increasing time and cost to solution. Traditional RTOS toolchains contain some productivity elements, such as Integrated Development Environments for instance. There still remains a gap, however, between the functionality of these offerings and the application domain, a gap now being filled by sophisticated yet easy-to-use software designed to support maximum developer productivity. These suites integrate multiple functions into one environment to decrease the cycle time of mapping the application to the system, running it, testing performance then remapping and repeating until the desired result is achieved.
Digital signal processing applications are becoming both more commonplace and more complex. At the same time, pressure to minimize development time and maximize in-service life is increasing. Without some form of intervention, the two vectors — increasing complexity and minimizing lead times — are mutually incompatible. However, appropriate tools are now becoming available that intervene and allow the two vectors to co-exist.