The computing press is full of discussions about multicore systems, defined here as single-chip computers containing two or more processing cores each connected to a common shared memory (Figure 1).

These devices are being presented as the solution to the performance problems faced by embedded systems, but in fact, multicore may be more of a problem than a solution.

Why Multicore?

Figure 1. Typical multicore system layout

Advances in silicon technology have been dramatic, but manufacturers have passed the long-anticipated point where the costs of squeezing more performance from traditional sequential processors outweigh the benefits. Everybody has known for years that performance increases must eventually be achieved by going parallel, but the issue has been how.

The first steps in this direction involved increasing the complexity of processors so they could execute two or more instructions at once, either by the hardware detecting opportunities for parallelism dynamically, or by making compilers group instructions explicitly for parallel execution. This has been largely successful and is a delight to marketing departments because the process is invisible and requires few changes to existing programs, if any. It was progress, but it, too, ran into the physical limits of silicon, so another change was necessary.

Hardware designers observed that personal computers had a very fortunate property: they ran multiple independent programs at the same time (spreadsheets, email, social networking applications, music downloads, and so on). It didn’t take long for them to realize they could easily duplicate a processing core on a chip. Give them both access to a common memory and now you can execute two of the unchanged programs at the same time. Advertising was quick to imply that this dual-core system runs twice as fast as the single-core version. Of course it doesn’t; your two-hour build of a large FPGA bitstream still takes two hours because it can use only one of the cores. Ignoring the hype, there can be real benefits, such as a reduction in power consumption and, most importantly, total throughput is increased as two programs can now actually run in parallel; two of your enormous builds complete in the time it used to take for just one. Extend the idea to multiple threads within a program and the opportunities for improvement seem to be multiplied without limit. Once again there is the siren lure of customers getting something for nothing.

It doesn’t take much imagination to reach the idea that you simply add more and more cores on your memory to deliver essentially unlimited performance with no development effort needed by users. The reality is different.

The Real Problems

Figure 2. Debugging cache line errors in multicore systems can be problematic.

The first problem is the shared memory all the cores need to use. As soon as you have a shared resource you have a potential bottleneck. With many cores trying to access memory, some will have to queue until others finish their accesses, and a core on a queue is no help in making applications run faster. Designers attempt to solve this by making the memory system more intricate, adding local memory and local caches, and generally increasing the hardware complexity.

You also suffer from diminishing returns, because the difficulty in finding something for a core to do increases with the number of cores. It’s bad enough for a PC, but if an embedded system is running one, dedicated, sequential application, it is unlikely there will be enough independent execution paths to benefit from many cores.

What about bugs? Your program crashes because memory has been corrupted, where’s the error? Debugging when there is only one processor involved can be tricky, but it is almost impossible if you have many possible culprits all sitting there poised to scribble anywhere they like. You could implement memory protection but that’s yet another increase in complexity, both for the hardware and software.

As the complexity of the hardware rises, the opportunity for subtle errors grows too. An insidious one involves common cache mechanisms with two cores using distinct but abutting memory areas that share a cache line, the unit of memory managed by a cache. Everything looks to be under control, but when one core writes to its memory the whole of a cache line can get written back. This makes the required update to memory, but also corrupts the other core’s area with values cached earlier (Figure 2). Debugging that is a nightmare; the problem is totally dependent on the execution order of instructions in different processors. Stopping one with a debugger changes the timing of accesses and can hide the problem.

Problems like these persuade hardware designers to add more features, snooping caches or whatever, requiring more and more logic and ever more complexity. The industry doesn’t see a problem because of the bizarre fact that computing is, perhaps, the only engineering discipline where “more visibly complicated” is applauded as “better”. It is evident that simply increasing the number of cores on a memory is not the way to go, but the only escape route many see is to build systems from clusters: several separate multicore devices connected not by shared memory but by communication links. Now you can keep the number of cores on a memory below the threshold where contention slows things down badly but you have created another problem. Everything had been predicated on system transparency where users didn’t need to know what’s happening and could continue to run code written in an ancient style. You lose that transparency immediately if you are forced to overflow onto a core that isn’t connected to memory the others are using: core A on chip X can communicate with core B on X through their common memory, but A has to use I/O for an essentially identical communication with core C on chip Y (Figure 3). Welcome to multiple communication mechanisms and even more complexity.

A Solution

Figure 3. Designing cluster-based systems where separate multicore devices are connected by commu- nication links instead of shared memory presents a different set of problems.

Whenever you find you are solving a problem by adding complexity and that solution in turn generates further problems, it’s usually because you’ve made a fundamental mistake early on in the process. Against developers’ natural inclination, we should be prepared to backtrack and revise earlier decisions. What is that fatal flaw in the multicore argument? It’s the idea that we can continue to write large, monolithic systems using techniques designed for single-processor systems and expect them to be appropriate for multiprocessor systems.

Rather than throwing the baby out with the bath water by creating an overblown API, inventing yet another language, or, worse, adding ad hoc gargoyles and curlicues to existing languages, we should recognize there is a great deal of skill and experience in writing and debugging sequential code with languages such as C. People have a natural feel for sequential processes and it would be foolish to ignore the fact.

We are going to be forced ultimately to have multiple, separate processors communicating by ultra-high-speed links, so we should bite the bullet now and adopt a programming style acknowledging both the sequential skills of developers and the distributed nature of processing engines. A scheme addressing these two points finds a natural home on multicore devices but will not be a dead end requiring another shift in programming style when shared-memory systems inevitably hit the buffers.

Multiprocessor hardware design should address the following:

  • Stop the explosion of complexity so painfully evident in current designs.
  • Use simpler processors that can be programmed efficiently using unaltered popular languages, such as C, to implement sequential components that communicate with other such components.
  • Take advantage of the independence of these components to allow them to be debugged separately.
  • Drop the almost religious faith in shared resources such as common memory.
  • Use their simplicity to make processors small and as fast as possible, leaving room for many on a chip with all the resulting energy and other savings.
  • Do not waste silicon on an array of specialized peripherals. Most processors will do no real-world I/O, so limit the replicated peripherals to a uniform set of extremely fast, trivially-simple, point-to-point, serial links with none of the unnecessary baggage of current devices.
  • At the periphery, where the system interacts with the real world, provide


Multicore processors exist today and are pushing developers into a mode of working that is inherently a dead end. We should embrace parallel working by designing simple processing components that communicate efficiently and program them using straightforward, sequential techniques based on communicating programs.

This article was written by Sebastien Maury, America Regional Director, Sundance Multiprocessor Technology (Buckinghamshire, UK) and Dr. Peter Robertson, Managing Director, 3L (Edinburgh, Scotland). For more information, contact Mr. Maury at This email address is being protected from spambots. You need JavaScript enabled to view it., or visit

Embedded Technology Magazine

This article first appeared in the March, 2010 issue of Embedded Technology Magazine.

Read more articles from this issue here.

Read more articles from the archives here.