Building a multicore system means dealing with non-determinism. Interactions between tasks running on different cores can occur in a different order, and at a different rate from one run to the next. This makes it harder to reproduce, find, and fix bugs. It also lowers the probability that validations and QA have caught all problems.

The following is an example of how moving to a multicore processor can introduce bugs that are difficult to track down because of non-determinism.

Example of a Multicore Bug

Figure 1. Timeline of a multicore bug.
A network router has been set up to handle ARP responses in an interrupt handler. The handler checks to see if a global lock is held, and if not, it updates the ARP cache immediately and returns to the idle loop. If the lock is held, then it adds the ARP response to a list and the update happens later. This design was initially implemented for a single core system, where no other code can execute while the interrupt is being serviced, so the design worked every time. However, when ported to multicore, the second core may grab the lock shortly after the ISR on the first core has checked to see if the lock is taken. This lead to crashes, as shown in Figure 1.

Debugging this crash is difficult because it is the first core that triggered the problem; but when debugged the first core appears to be unrelated to the problem because it is in an idle loop, doing nothing wrong.

However, this is the easy to debug case. Sometimes, instead of crashing on a NULL pointer dereference, core 2 could end up getting back the wrong ARP table entry. This would result in the packet being sent to the wrong client, which would result in bugs that are even harder to track down.

How to Move to Multicore

Almost everyone who switches to multicore finds that the application they thought was properly written to handle concurrency problems, actually has lots of bugs. Fortunately, there are tools available to make the transition from single core to multicore less painful.

We are going to focus on tools and features that are provided by the operating system. Most embedded operating systems, such as Green Hills Software’s INTEGRITY® RTOS, provide these features. The following tips on how to use these tools are focused on increasing the determinism of a system. This will make it possible to release the product faster and with fewer painfully long debug and test cycles.

Increase Determinism with Address Spaces

The simplest OS feature is splitting threads into their own address spaces. Yes, this is technology from the 70s, but much of the embedded world still runs everything in the kernel or a small handful of giant address spaces with no MMU protection; for example: over 100 threads with several million lines of web server, secure shell, configuration and other application features all running in kernel space. The entire system is at the mercy of the worst programmer, on his worst day, making risky changes just before he leaves for a three week vacation.

Pulling threads into their own address space makes it easier to tell where the critical sections are that require locking. When threads share the same address space, every line of code is a potential critical section that needs to be checked. When running in different address spaces, the communication and interaction between them is, by necessity, far more constrained, which means it is easier to analyze the interactions and ensure that they are correct. It is also easier to log those interactions.

Always Log So You Can Always Debug

Robust, low overhead logging provided in the OS is needed to help you manage the inevitable non-determinism that will creep into your system. No matter what you do, if you interact with the outside world in any way, your program will be non-deterministic to some degree. It is essential to have logging in the OS, so you can see what is happening at the lowest levels, interleaved with additional application specific logging you add to the log stream.

To make sense of the log data, you also need a graphical tool that will display it in a linear time display. This should let you easily zoom into areas of interest, correlate specific events with the threads that generated them, and tell you where they came from in the code. You need this when developing a multicore system, because there will inevitably be problems that show up very infrequently, perhaps only once in weeks of testing. With logging, there is a good chance that it will be possible to figure out what happened and fix it. The simplest way to do this is to compare a log of the failure with a log of a successful run. Differences in the logs are good indications for where to narrow the search.

The logging mechanism needs to have low overhead so that it can always be on. This is for three reasons:

  1. Some bugs are so timing dependent that they disappear when logging is enabled. If logging is always on, however, even in the final production system, then you will never have this problem.
  2. If a hard-to-reproduce bug does show up, you will have a log of it and you can investigate the problem right away. If, instead, you only turn the log on when you are looking for a problem, you are bound to waste time trying to get a log of the failure after the fact.
  3. If a problem shows up in the field, there will be a log to inspect. Asking customers to turn on logging and spend their time reproducing problems leads to angry customers, especially when they cannot get the problem to reproduce with logging enabled.

If the log is always on it will eventually overflow the log buffer. To address this problem the log should be stored in a circular buffer. When a problem is detected the code should save the log buffer to more permanent storage for inspection later.

Of course, there is a cost to always having logging enabled — it will use up some CPU time, and it will use up some memory. However, if you place the log points carefully, you will find that it will save considerable development and release time. Elusive bugs will be fixed instead of being discovered by customers.