All major microprocessor vendors are introducing multicore architectures to deliver the latest installment of performance improvements and cost savings as required to remain competitive in the silicon marketplace. Indeed, multicore processors offer increases in processing throughput, savings in power consumption, and reductions in heat dissipation. However, some effort is required to fully exploit multicore benefits. Unfortunately, the transition to a multicore architecture is often best described by the old adage, “no pain, no gain.”
As with many large undertakings, a key to making efficient use of multicore is to divide the workload into many smaller components, and to conquer each component independently. If the original workload can be divided into components of nearly identical size, then it’s reasonable to divide the original workload into as many components as there are available cores. If, however, the effort associated with each component task is difficult to predict, or varies from one component to the next, then you’ll need to divide the original workload into many more tasks than the number of available cores, allowing processors to automatically balance the workload. Cores assigned large tasks will work almost exclusively on those tasks, while any core assigned a small task will quickly complete that assignment and begin working on another.
These rules of thumb apply if the primary goal is to improve the quantity of data processed per unit of time. Other goals might motivate different tradeoffs. For example, to maximize battery life in a handheld device, cores may be turned off or their clock rates slowed. In this configuration, multiple cores enable very rapid response to urgent user demands, assuming it is possible for all cores to quickly power up and efficiently coordinate their efforts in response to the end user’s request. Optimizing responsiveness to urgent user requests on multicore platforms depends on being able to effectively utilize each core in providing the response. If the response computations cannot be easily balanced between multiple processor cores, a uniprocessor solution with a higher clock rate might actually deliver a higher-quality end user experience.
Speaking the Multicore “Language”
Engineers responsible for developing software for, or transitioning to, a multicore platform must consider general principles of organizational efficiency. By example, as managers delegate responsibilities, they must clearly identify the responsibility to the subordinate, and must grant to the subordinate all the authority required to fulfill the responsibility. The same principle applies directly to the division of labor among software components. In order to effectively utilize the full capacity of multicore processors, each core must be able to complete its assigned tasks without depending on frequent coordination with other cores. On modern multicore processors, coordination with another core may rob the currently executing application of hundreds of machine instructions’ worth of useful computation. Thus, an essential key to multicore optimization involves selection or development of algorithms that allow cores to complete execution without requiring frequent coordination with other cores.
Once the division of responsibilities is completed, the choice of programming language can influence how the intended semantics are implemented. The Java language, for example, was designed to support multiple threads of execution running on independent cores. As such, the language syntax has certain built-in features that map directly to the special needs of multiprocessor computation. The C and C++ languages
were designed before multiprocessing was common, and they were originally designed to support a single thread of execution.
Regardless of the programming language, software engineers must carefully use the language to correctly implement the multiprocessor coordination activities that are required for the chosen division of responsibilities and selected algorithms. If used incorrectly, essential coordination may be omitted or excessive coordination may be implemented, resulting in unreliable or inefficient operation.
With Java, the built-in synchronized keyword marks a sequence of instructions that is accessing variables potentially shared between multiple processors. The Java virtual machine implements a mutual exclusion lock upon entry to the synchronized block of code, guaranteeing that only one thread at a time is allowed to be executing this block of code. The Java language imposes automatic memory barriers as part of the implementation of every synchronized statement. Upon entry into the synchronized statement, all cached copies of non-local variables are refreshed from shared memory. Upon exit from the synchronized statement, any cached values of non-local variables that might have been modified by the current thread are written to shared memory. Further, the Java compiler is prohibited from reordering instructions across these memory barriers.