The computing press is full of discussions about multicore systems, defined here as single-chip computers containing two or more processing cores each connected to a common shared memory (Figure 1).

These devices are being presented as the solution to the performance problems faced by embedded systems, but in fact, multicore may be more of a problem than a solution.

Why Multicore?

Figure 1. Typical multicore system layout

Advances in silicon technology have been dramatic, but manufacturers have passed the long-anticipated point where the costs of squeezing more performance from traditional sequential processors outweigh the benefits. Everybody has known for years that performance increases must eventually be achieved by going parallel, but the issue has been how.

The first steps in this direction involved increasing the complexity of processors so they could execute two or more instructions at once, either by the hardware detecting opportunities for parallelism dynamically, or by making compilers group instructions explicitly for parallel execution. This has been largely successful and is a delight to marketing departments because the process is invisible and requires few changes to existing programs, if any. It was progress, but it, too, ran into the physical limits of silicon, so another change was necessary.

Hardware designers observed that personal computers had a very fortunate property: they ran multiple independent programs at the same time (spreadsheets, email, social networking applications, music downloads, and so on). It didn’t take long for them to realize they could easily duplicate a processing core on a chip. Give them both access to a common memory and now you can execute two of the unchanged programs at the same time. Advertising was quick to imply that this dual-core system runs twice as fast as the single-core version. Of course it doesn’t; your two-hour build of a large FPGA bitstream still takes two hours because it can use only one of the cores. Ignoring the hype, there can be real benefits, such as a reduction in power consumption and, most importantly, total throughput is increased as two programs can now actually run in parallel; two of your enormous builds complete in the time it used to take for just one. Extend the idea to multiple threads within a program and the opportunities for improvement seem to be multiplied without limit. Once again there is the siren lure of customers getting something for nothing.

It doesn’t take much imagination to reach the idea that you simply add more and more cores on your memory to deliver essentially unlimited performance with no development effort needed by users. The reality is different.

The Real Problems

Figure 2. Debugging cache line errors in multicore systems can be problematic.
The first problem is the shared memory all the cores need to use. As soon as you have a shared resource you have a potential bottleneck. With many cores trying to access memory, some will have to queue until others finish their accesses, and a core on a queue is no help in making applications run faster. Designers attempt to solve this by making the memory system more intricate, adding local memory and local caches, and generally increasing the hardware complexity.

You also suffer from diminishing returns, because the difficulty in finding something for a core to do increases with the number of cores. It’s bad enough for a PC, but if an embedded system is running one, dedicated, sequential application, it is unlikely there will be enough independent execution paths to benefit from many cores.