All major microprocessor vendors are introducing multicore architectures to deliver the latest installment of performance improvements and cost savings as required to remain competitive in the silicon marketplace. Indeed, multicore processors offer increases in processing throughput, savings in power consumption, and reductions in heat dissipation. However, some effort is required to fully exploit multicore benefits. Unfortunately, the transition to a multicore architecture is often best described by the old adage, “no pain, no gain.”

The PERC Ultra real-time virtual machine uses CPU affinity APIs to dispatch the N highest priority Java threads from a global ready queue to N available cores. The underlying operating system manages contention between Java threads and non-Java threads. The system configuration may partition available cores so that some are dedicated entirely to Java threads and others are dedicated to execution of non-Java threads. Though this illustration suggests that the operating system maintains independent ready queues for each core, some operating systems may use different representations.
As with many large undertakings, a key to making efficient use of multicore is to divide the workload into many smaller components, and to conquer each component independently. If the original workload can be divided into components of nearly identical size, then it’s reasonable to divide the original workload into as many components as there are available cores. If, however, the effort associated with each component task is difficult to predict, or varies from one component to the next, then you’ll need to divide the original workload into many more tasks than the number of available cores, allowing processors to automatically balance the workload. Cores assigned large tasks will work almost exclusively on those tasks, while any core assigned a small task will quickly complete that assignment and begin working on another.

These rules of thumb apply if the primary goal is to improve the quantity of data processed per unit of time. Other goals might motivate different tradeoffs. For example, to maximize battery life in a handheld device, cores may be turned off or their clock rates slowed. In this configuration, multiple cores enable very rapid response to urgent user demands, assuming it is possible for all cores to quickly power up and efficiently coordinate their efforts in response to the end user’s request. Optimizing responsiveness to urgent user requests on multicore platforms depends on being able to effectively utilize each core in providing the response. If the response computations cannot be easily balanced between multiple processor cores, a uniprocessor solution with a higher clock rate might actually deliver a higher-quality end user experience.

Speaking the Multicore “Language”

Engineers responsible for developing software for, or transitioning to, a multicore platform must consider general principles of organizational efficiency. By example, as managers delegate responsibilities, they must clearly identify the responsibility to the subordinate, and must grant to the subordinate all the authority required to fulfill the responsibility. The same principle applies directly to the division of labor among software components. In order to effectively utilize the full capacity of multicore processors, each core must be able to complete its assigned tasks without depending on frequent coordination with other cores. On modern multicore processors, coordination with another core may rob the currently executing application of hundreds of machine instructions’ worth of useful computation. Thus, an essential key to multicore optimization involves selection or development of algorithms that allow cores to complete execution without requiring frequent coordination with other cores.

Once the division of responsibilities is completed, the choice of programming language can influence how the intended semantics are implemented. The Java language, for example, was designed to support multiple threads of execution running on independent cores. As such, the language syntax has certain built-in features that map directly to the special needs of multiprocessor computation. The C and C++ languages

were designed before multiprocessing was common, and they were originally designed to support a single thread of execution.

Regardless of the programming language, software engineers must carefully use the language to correctly implement the multiprocessor coordination activities that are required for the chosen division of responsibilities and selected algorithms. If used incorrectly, essential coordination may be omitted or excessive coordination may be implemented, resulting in unreliable or inefficient operation.

With Java, the built-in synchronized keyword marks a sequence of instructions that is accessing variables potentially shared between multiple processors. The Java virtual machine implements a mutual exclusion lock upon entry to the synchronized block of code, guaranteeing that only one thread at a time is allowed to be executing this block of code. The Java language imposes automatic memory barriers as part of the implementation of every synchronized statement. Upon entry into the synchronized statement, all cached copies of non-local variables are refreshed from shared memory. Upon exit from the synchronized statement, any cached values of non-local variables that might have been modified by the current thread are written to shared memory. Further, the Java compiler is prohibited from reordering instructions across these memory barriers.

Two multicore benchmarks demonstrate that dividing the total work into multiple threads yields higher performance, but interdependencies between threads limit the speedup to less than the total capacity of each additional core, as predicted by Amdahl's law. Measurements report performance of the PERC Ultra SMP virtual machine running Java threads on Dell PowerEdge 1900, Dual Quad Core Xeon E5310, 1.6-GHz hardware.
Declaring a Java variable to have the volatile property has a similar effect. No instructions may be reordered around a fetch or store of a volatile variable. Further, all cached copies of non-local variables are refreshed from shared memory at the moment of a volatile read, and all cached values are committed to shared memory at the moment of each volatile write. Note that these built-in language features make it straightforward to develop portable and maintainable code that will run reliably on a variety of multiprocessor configurations. Given the semantic guarantees offered by these built-in syntaxes, it is important to use them judiciously.

By comparison, support for multiple threads with C and C++ was added after the original language definition. POSIX Libraries enable a new thread to start up and to enforce mutual exclusion locks with semaphores and other locking mechanisms. For variables declared with the volatile keyword, the compiler promises not to reorder assignments and fetches associated with other volatile variables. However, it does not guarantee the absence of reordering with respect to other variables that are not declared volatile. If developers want to enforce that the compiler does not reorder accesses to multiple related variables, they have to declare all of those variables to be volatile, meaning that every access to any of those variables will be more expensive than a normal access.

Another difficulty with C and C++ is that the language does not provide control over reordering of instructions and memory barriers with respect to invocation of semaphore operations and other POSIX services. Certain C compilers (e.g. Gnu gcc) provide special directives to enforce memory barriers, but these are non-standard and non-portable. A proposed C++0x revision of the C++ language provides new mechanisms to improve memory barrier abstractions in C++ code. Of course, existing C++ code has not been written to use this new standard, and it may be a while before C++ compilers are updated to support the new standard.

In summary, the greatest challenge of moving to multicore is structuring the workload to be efficiently divided between multiple independent threads. Once the workload is so partitioned, implementing the design is a matter of mapping the desired communication and coordination activities to the chosen programming language. Languages like Java provide multiprocessor programming notations that are portable and efficient. Legacy languages like C and C++ can support multiprocessor applications, but currently require the use of non-standard and non-portable compiler features.

This article was written by Kelvin Nilsen, CTO for Java at Atego Systems, San Diego, CA. For more information, Click Here .