Historically, processors from the PowerPC® family, now known as Power Architecture® processors, have been the dominant choice for implementing Digital Signal Processing (DSP) in high-performance embedded military applications that take advantage of open-system commercial off-the-shelf (COTS) products. These applications include radar, signal intelligence, sonar, and image processing. Today, however, beginning with the dual-core Intel Core™ i7 processors, the low-power, high-performance advantages of the Intel architecture processor technology can be used for the first time to design DSP engines for the rugged deployed COTS signal processing space.
In the early 1990s, systems were implemented largely with specialized processors such as the Intel® i860 processor, the Texas Instruments 320C40, and the Analog Devices SHARC®. These processors were popular because of their floating-point performance.
In the late 1990s, Analog Devices and Texas Instruments introduced follow-on processors, the TigerSHARC® and 320C6701, respectively. Both had limited success, partly due to lack of software compatibility with their predecessors. The PowerPC processor from the Apple®/IBM®/Motorola® alliance, intended for personal computer use to compete with Intel x86-based processors, was also introduced at this time. Its reduced instruction set computer (RISC)architecture was touted to be the future of high-performance microprocessors. But it was the introduction of the AltiVec™ instruction unit in the Motorola PowerPC 7400 (G4) that truly changed the signal processing landscape.
Signal processing experts soon realized that the floating-point-capable AltiVec unit could greatly accelerate the innerloop processing found in common functions such as fast Fourier transforms (FFTs). The ability to perform up to four simultaneous floating-point multiplies and additions was revolutionary. Consequently, the PowerPC with AltiVec has had a long run in the military market with a continuous succession of faster processors, ending with the MPC8640/8641.
Curtiss-Wright Controls’ most recent multiprocessor DSP engine, the CHAMP-AV6, is based on the Freescale™ MPC8641 processor. Freescale has decided not to include the AltiVec unit in its next generation of high-performance processors. The QorIQ™ P4080 processor, announced last year, is an excellent choice for a single board computer (SBC), with its eight cores, integrated memory controllers, and Serial RapidIO® (SRIO) interface. Unfortunately, the lack of AltiVec severely impairs its floating-point performance. The processor core still features a regular floating-point capability, but it is not a vector processor, which is required to attain the level of performance needed for signal processing applications.
In contrast, Intel has continued to develop the floating-point capability of its processors. Intel processors feature a vector-processing unit generically known as Streaming Single Instruction, Multiple Data (SIMD) Extensions (SSEs), first introduced in the Intel Pentium® III processor. Since then, Intel has continually added features and new instructions, culminating in the current implementation, Intel Streaming SIMD Extensions (Intel SSE 4.2). Like AltiVec, SSE is a 128-bit wide processing unit, capable of simultaneously operating on four 32-bit floating-point values. SSE also features support for double-precision floating point, a feature that was never included in AltiVec. In multi-core Intel processors, each core has its own SSE unit, so the raw floating-point performance scales with the number of cores.
Inte lx86 processors are classic CISC (complex instruction set computing) processors. Since many more instructions per clock cycle get done, and the code density is higher, Intel processors can perform more than twice the useful work per clock cycle as a RISC processor. As a result, beginning with the dual-core Intel Core i7 processors, the low-power, high-performance advantages of the Intel architecture processor technology can be used for the first time to design products such as DSP engines for the rugged deployed COTS signal processing space.
Signal Processing Performance
The latest generations of Intel architecture processors are produced on 32nm process technology and based on the Intel microarchitecture, formerly codenamed Westmere, that includes many features that suit high-performance and power-efficient execution of signal processing workloads.
To support high-instruction throughput, the Intel microarchitecture contains a sophisticated memory sub-system. In a quad-core processor, each core contains a first-level instruction cache (32KB 4-way), a first-level data cache (32KB 8-way), a second-level unified cache (256KB 8-way), and a third-level cache of up to 8MB 16-way that is shared among all the processor cores. With two or three DDR3 memory controllers, the processor can provide a peak memory bandwidth of 17.1 or 25.6 GB/s. This high-throughput capability is required to support the multi-gigabit rates for the processing of the sample streams in military signal processing applications such as radar.
Support for the efficient implementation of high-throughput signal processing is based on SSE instructions, which are extensions to the standard Intel Instruction Set Architecture (ISA). Including the latest generation, Intel SSE 4.2, there are more than 300 SSE instructions. SSE operations work from a set of sixteen 128-bit wide XMMx registers, capable of simultaneous operation on four packed floating-point values, as well as other formats (see accompanying table).
One of the most common signal processing algorithms is the FFT. The FFT implementation shown in the accompanying figure is a version that is included in the Intel Integrated Performance Primitives (Intel IPP) library.
Effective implementation of signal processing algorithms requires efficient use of all resources on the processor platform, so the ability to parallelize algorithms across multiple cores in a linear manner is essential. Parallelized scaling across the multiple cores of an Intel microarchitecture-based platform can be done for common operations used in signal processing such as complex multiplication, or for more computationally-intense algorithms. A threading model can be used to implement the complex multiplication algorithm with parallel execution. The input data is divided into blocks, and each block, or number of blocks, depending on data size, is executed in full in separate parallel threads. This method assumes no interdependence between blocks. Some processor architecture parameters must be taken into account to optimize performance, including cache size, cache line alignment, and thread affinity, and inter-thread dependencies must be minimized or avoided.
Intel Core™ i7 Processor Power Management
The discussion of processor performance in a rugged embedded application is entirely academic without also considering power consumption and cooling. Readers may understandably have a “>100W” impression of Intel processors from the perspective of gaming computers with massive CPU coolers. Perhaps less well known are the advances Intel has made in low-power operation. The scope of technologies devoted to power reduction is large, but the combination of internal architecture, low-leakage 32nm and 45nm high-K/metal gate transistors, and low-voltage capabilities means that system developers have a catalog of embedded-oriented devices to choose from, with maximum thermal design power ranging from 18W and under to 45W. With the reduction of the number of necessary companion chips, the Intel Core™ i7 processors have become viable for multiprocessor board level products that will offer large gains in performance/watt relative to the prior generation of Power Architecture/ AltiVec designs.
In conjunction with excellent FLOPs/ watt metrics, Intel Core i7 processors have very useful capabilities for tailoring power consumption and monitoring silicon temperature. Intel SpeedStep® technology allows fine-grained control over the processor’s operating frequency. Recalling that processor power consumption has a square-law relationship with voltage, the ability of the Intel Core i7 processor to direct the core power supply to reduce voltage at lower frequencies further provides power savings gains. Intel Core i7 processors feature a multipoint temperature sensor feature that allows users to more confidently “dial up” performance, since the system can provide a precise indication that thermal limits are not being exceeded. This ability affords developers the opportunity to fine-tune system performance to minimize power consumption, especially in hot environments such as next to a jet engine. It will also be useful for systems that run in a reduced operational mode, such as a targeting radar that does not need to be fully active until a pilot engages that function.