Multi-core processing is a "disruptive technology", transforming the way embedded systems are architected, developed, and debugged. With greatly improved performance and lower power, multi-core processors have caught the attention of designers who don't think twice about putting two, four, or even eight processor cores into a system. But many software developers are playing catch up, working hard to quickly parallelize code. They are finding traditional debug methods aren't sufficient to profile the complex interactions between concurrently running tasks. "In a system with interacting applications running simultaneously on multiple processor cores, breakpoints have reduced applicability as a tool for understanding system behavior," says Rob McCammon, Director of Advanced Technology Planning at Wind River Systems.
While the embedded world has been implementing multi-processing for the past decade, this is mostly with heterogeneous cores running separate workloads or asymmetric multiprocessing (AMP). The challenge is to make the cores work together, even though they typically run different application software and may have their own operating systems, compilers, and so on.
Symmetric multiprocessing (SMP) is coming of age in embedded with the recent availability of processors with multiple, homogeneous cores on the same die. CPUs, like the Intel Core 2 Duo and Quad processors, allow any core to work on any task since they share system memory. The challenge is to split up a workload so all the cores are busy without stepping on each other's toes. "In many cases, embedded and real-time code doesn't need much modification for SMP; it's generally multithreaded. If good programming practices have been followed, porting your code can be fairly straightforward, particularly if you're using a mature SMP OS like QNX Neutrino," says Bill Graham, Development Tools Product Line Manager, QNX Software Systems.
Wind River and QNX Software Systems are members of the Intel Communications Alliance (ICA), a community of communications and embedded developers and solutions providers committed to the development of modular, standards-based solutions on Intel technologies. Wind River and QNX provide operating systems and tools for multi-core software developers.
Visualize Multi-Core Behavior
Hardware breakpoints are great for debugging serial code. Stop the system, look at the stack, read the registers and verify the computation is correct. But understanding multi-core behavior is beyond the reach of traditional debug and profiling tools, which primarily detect problems within single programs. With conventional tools, multi-core system developers have to gather information separately from each core and then somehow combine the information for analysis.
Multi-core tools give developers greater insight into the dynamic interactions between cores, typically using visualization. This provides a powerful picture of behavior because it's easier to see workload balance, contention, and race conditions.
To take advantage of the true hardware parallelism offered by multi-core processors, software developers have to split up their computations into separate threads that run concurrently. Task-parallel threads are independent code sequences that can run simultaneously. Data-parallel threads are code sequences that perform the same computation over and over on a set of data, such as dimming all of the pixels on an LCD display. When an operation is divided into threads and then distributed over multiple cores, it normally executes much faster than when it runs sequentially on one core.
Workload Balance
The most fundamental benefit from using visual multi-core tools is understanding how the workload is distributed among the cores. Normally, developers want to balance the workload to maximize the compute performance of the system – all cores pulling their own weight. Unique situations may warrant keeping a core lightly loaded so it responds faster to time-critical code, like interrupt servicing. For either case, it's important to have a tool that clearly displays core usage and shows which threads are running on each core.
In Figure 1, the right screenshot shows the QNX Momentics System Profiler displaying the CPU usage of four cores executing an application over a 22 millisecond time interval. This macro-level perspective helps identify opportunities to balance the workload across the system. The left screenshot displays the names of different threads and how much time they spend running on each core. Four threads are highlighted in the bottom half of the shot, and the CPU cycles they consume are shown in the top half.
Thread migrations are shown in Figure 2, along with a migration histogram. In conjunction with Figure 1, one can detect whether a thread is efficiently taking advantage of the CPU cycles of all the cores or is perhaps incessantly thrashing among them. Again, this type of visual information helps developers control utilization of key system resources.
One thread, netlogger, runs on all four cores. This could degrade performance if the thread migrates excessively, adding context switching overhead and decreasing the effectiveness of CPU cache memory. Thread migration may also be desirable as it allows for dynamic load balancing. Developers may decide to control the execution of this thread by anchoring it to a single core and eliminating its ability to migrate. Furthermore, the CPU loading information shows which core has the lowest usage, indicating an opportunity to force a time-critical thread to run on a less-loaded core.
Trap the Right Data
To provide a deep system insight, tools can gather a massive amount of system information including interrupts, kernel calls, scheduling events, thread states and much more. "Developers want a rich picture of the behavior of the system ... critical information may result anytime the operating system or application does something interesting. Visibility into when a task starts, stops, or switches, or when a user defined event, such as a loop counter reaching a specified value, occurs is invaluable," says Rob McCammon.
In Figure 3, you can see examples of the richness and the amount of data that tools can provide. Multiple windows allow the developer to zoom in on the behavior of multiple processes, threads, and functions. The visualization of each thread or function includes graphical symbols representing the timing of a wide range of events from semaphores to user defined event flags. The top panel in each view represents the total runtime over which data was collected while the lower panel shows a zoomed in view centered on a specific point in time. The horizontal, dashed red line in each view indicates the same point in time within each of the two views. The ability to move through time, zoom in, and zoom out easily is a key usability aid for the developer.
Some tools, like the Wind River ProfileScope, provide hierarchical profiling, allowing the developer to drill into nested functions and find the ones responsible for using the most CPU cycles. This is another way to zoom in on the right data when addressing multi-core software performance issues.
Visual tools can help the developer identify which code optimizations will yield the greatest performance improvement on the multi-core system. But often, code redesign or optimization isn't an option; particularly if a large amount of poorly synchronized legacy code has to be ported in short order. To ensure proper behavior on an SMP system, many SMP operating systems support task affinity which lets the system designer "lock" any process (and all of its associated threads) to a specific core. Other threads can continue to float dynamically between cores. This restricted scheduling behavior can ease the migration of software to an SMP environment by allowing existing applications to run on a multi-core system without modification.
Locate Bugs
In addition to the wide variety of bugs found in serial code, parallel code is susceptible to data corruption when two or more threads access data at the same time. Issues arise when one thread stores a temporary value in memory prior to completing its task and another thread reads the data and uses it, as if it were valid.
Multi-processing systems need mechanisms for protecting global data structures while allowing multiple processors to execute code concurrently in the system. Mutexes and semaphores are two common locking mechanisms that can prevent multiple threads from operating on the same data at the same time. They provide mutual exclusion or synchronize accesses between multiple threads.
Race conditions can also plague heavily threaded code. These occur when the output of one thread is unexpectedly and critically dependent on the sequence or timing of other threads. Unless threads are properly synchronized, they may incorrectly influence one another's behavior.
These bugs deserve careful attention and require visual tools to help determine their cause. Tools should clearly show multiple thread contexts at the same time, as well as their data and timing interdependencies.
Optimize Performance
Once software is free of bugs, the developer can focus on performance optimization. Two key avenues -- balancing workload and managing thread migration -- were already discussed. These methods can help speed up time-critical interrupts in embedded applications.
Sometimes correct code is not good code, like when semaphores and synchronization are used casually to ensure proper code execution and end up creating unnecessary thread stalls and deadlocks. The resulting waiting periods can seriously impact system performances, especially for time-critical code. Developers should investigate how well their tools help identify delays in thread execution.
Parallelization is a powerful tool to increase performance. Whenever developers observe key tasks taking too long to finish executing, they should consider breaking up the code into data-parallel threads. Other possibilities include assigning tasks to dedicated or lightly loaded cores.
Implement Virtualization
The power of multi-core processors is leading system developers to consider virtualization in a range of uses including supporting heterogeneous OS environments, isolating workloads, and implementing failover. These usages are possible because virtualization allows a single hardware system to simultaneously run multiple OS's (heterogeneous or homogeneous) with their associated applications. For the most part, many multi-core system architectures are well-suited to support virtualization. Hardware enhancements, such as Intel Virtualization Technology, can enable reductions in software size and complexity over traditional software-based virtualization solutions.
Virtualization and multi-processing go hand-in-hand, both promoting heavily parallel multi-threaded code environments. Visual tools play a key role in debugging and optimizing systems, especially with the added complexity of running multiple OS's simultaneously.
This article was written by Troy Smith, Program Director for the Intel Communication Alliance (ICA), and Jeff Liborio ,Business Alliance Manager for the ICA Program with responsibility for multi-core software programs at Intel (Santa Clara, CA). For more information, contact Mr. Smith at
For more information on QNX, contact Bill Graham, Product Line Manager, Development Tools, at
For more information on Wind River, contact Rob McCammon, Director of Advanced Technology Planning, at
Copyright (c) 2007, ABP International and Intel Corporation. All rights reserved
Intel, Intel logo, Intel Core are trademarks of Intel Corporation in the U.S. and other countries. Other names and brands may be claimed as the property of others.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.