Intel’s new Sandy Bridge microarchitecture is changing how software applications run and perform on server platforms. In order for applications to tap the full power of these new devices, developers will need to update not only their application software, but also the hardware platforms on which those applications run. Changes to Intel’s Xeon® E3 and E5 series of microprocessors include new instructions used to accelerate common encryption tasks and floating point calculations, as well as increased core counts and cache per CPU. Paramount to adoption is the critical thinking that developers need to consider to successfully transition to the Sandy Bridge microarchitecture.

Figure 1. Intel’s Tick-Tock Microarchitecture Roadmap
General-purpose microprocessors have traditionally served within the control plane of communications and networking equipment, leaving ASICs (Application-Specific Integrated Circuits), FPGAs (Field-Programmable Gate Arrays) and various accelerator cards to handle packet processing in the data plane. But that is all beginning to change as Intel’s faster and more efficient processors aim to replace many of the network processors commonly used in today’s enterprise- and carrier-class servers. Intel’s processor enhancements are also changing how pre-integrated server application software interoperates with onboard memory, disk drives, RAID controllers, and the Operating System (OS).

Enter Sandy Bridge

Sandy Bridge (Figure 1) is the codename for Intel’s next-generation Xeonbased microprocessor architecture, on which the E3 and E5 series of Xeon CPUs are based. As the successor to the Nehalem microarchitecture, Sandy Bridge CPUs are manufactured on Intel’s 32nm geometry process. Sandy Bridge is designed to enhance a range of applications that run on notebooks, desktop computers, and enterprise-class servers. This new architecture has been trial-demonstrated to provide up to 17% more CPU performance (clock-forclock) compared to Lynnfield 45nm quad-core Xeon X34xx processors.

Sandy Bridge processors will increase CPU processing, memory, and I/O performance while reducing bottlenecks for applications that demand real-time data rates. These processors are far better equipped to handle applications that demand greater throughput and compute power, including deep-packet inspection security algorithms that support network port expansion. Video, multimedia, and telecom application developers can also capitalize on its unmatched performance and deploy more powerful and efficient appliance platforms with highly scalable port densities.

Among the more obvious server-based applications that benefit are packet processing, image processing, security (e.g., cryptography), and a host of high-speed (40 Gb/sec) networking platforms. Developers working on these and other high-throughput applications need to move quickly to take full advantage of the Sandy Bridge microarchitecture for improved performance.

What’s Under the Hood

Figure 2. Sandy Bridge Platform Feature Comparison
Sandy Bridge is optimized to deliver up to 60% more performance and 30% greater energy efficiency compared to its predecessor. With more available cores, each core running faster, built-in PCI Express (PCIe) 3.0 capability, more memory channels, and faster QPI (QuickPack Interconnects), Sandy Bridge has the potential to create entirely new application categories. Figure 2 depicts key attributes for three primary device types.

Embedded PCIe

For the first time, PCIe 3.0 I/O is embedded into each multicore CPU directly. By integrating PCIe 3.0 into the Sandy Bridge CPU architecture, platforms based on Intel’s Xeon E5 series CPUs can offer double the payload throughput compared to PCIe 2.0 when utilizing the same number of PCIe lanes per device. Additionally, dual-processor platforms utilizing Xeon E5 series CPUs provide 80 total PCIe 3.0 lanes for device connectivity as compared to the 36 PCIe 2.0 lanes commonly available with its predecessor. Combined, this allows for over a 440% increase in available I/O bandwidth with Sandy Bridge-based servers.

This is an important milestone both for applications using RAID controllers and for those that must move from 10 Gb/s Ethernet to 40 Gb/s. It also reduces latencies to accelerate communications between fiber channel interconnects and InfiniBand switched fabrics. And as future I/O devices continue to increase in performance, Sandy Bridge offers the bandwidth needed for next-generation technologies, including emerging standards for technologies such as 12 Gb/s SAS controllers, direct connect PCIe Solid State Drives, 100Gb Ethernet, and high-performance GPUs.

Memory Upgrade

Sandy Bridge also offers more memory bandwidth. Memory channels running up to 1,333 MHz on Xeon 5600 seriesbased platforms can now achieve 1,600 MHz on Xeon E5-based platforms. And for applications requiring peak memory throughput, Xeon E5-2600 series processors now include four memory channels per CPU versus just three available on previous-generation Xeon 5600 series products. This allows dual processor servers utilizing E5-2600 CPUs to offer up to eight independent memory channels, each running at up to 1,600 MHz.

Because of this, DDR3 memory modules should be balanced in groups of eight for optimal performance. An ideal configuration would comprise either eight, 16 or 24 individual memory modules. Left unbalanced, server functionality may degrade and applications will not be able to take complete advantage of the full memory bandwidth available.

Turbo Mode

Figure 3. CPU Models Available in 2012
Turbo Mode Version 2 is extremely sophisticated and self-adjusts processor “gears” depending on the load. Processors with Turbo Mode 2 are allowed to factory overclock themselves when certain cores are underutilized or when the system has significant thermal headroom. This means that if developers are using a 2.4 GHz, 8-core CPU and all eight cores are not utilized – either because the application does not thread well or certain cores are dormant – the remaining CPU can automatically run faster.

By taking power away from inactive cores and applying it to active cores, a 2.4 GHz core can run at a higher multiplier, increasing its speed beyond 2.4 GHz and completing tasks and threads faster. The result is that when new programs or tasks are called on, they launch and run faster.

The same is true when certain cores are in heavy demand. Sandy Bridge will underclock or turn off unused cores, and quickly apply that power to the cores needed to complete other core tasks. Turbo Mode is controlled by the operating system and system BIOS and requires no special coding by an application programmer.

New Instructions

Sandy Bridge’s Advanced Encryption Standard - New Instructions (AES-NI) provide new extensions to the x86 instruction set architecture for microprocessors and promise to boost the handling of AES-based encryption and decryption. AES-NI can be used to accelerate the performance of AES by 3 times to 10 times over software-only implementations.

Intel AVX is a new 256-bit instruction set extension to SSE and designed for applications that are floating pointintensive. Intel AVX improves performance with wider vectors, new extensible syntax and enriched functionality, all of which enable better management of data for general-purpose applications like image, audio/video processing, scientific simulations, financial analytics, and 3D modeling and analysis.

Model Hierarchy

Several CPU models were scheduled to become available in 2012 (Figure 3). They include the E3-1200, E5-2400 E5- 2600 and E5-4600 (names in which the first number following E3/E5 indicates the number of CPUs that can be installed). Accordingly, the E3-1200 CPU is designed for single-socket systems, the E5-2400 CPU for dual-socket systems, and the E5-4600 for quad-socket systems.

The sweet spot for most developers will be the E5-2400 and E5-2600 series. The E5-2600 includes an extra memory channel and two QuickPack Interconnects, which allow the two CPUs to communicate twice as fast. And the E5- 2600 also delivers up to 80 PCIe 3.0 lanes (40 per CPU) – a tremendous improvement over earlier CPUs. So if the application thread running on the first CPU needs to access a PCI card plugged into the second CPU, it can use that QPI to jump over to the other CPU and complete the request. For applications requiring several PCIe cards, the E5-2600 is likely the best choice.

Code Optimization

To get the best performance out of the Sandy Bridge architecture, it is essential to use the best available compiler, performance primitives, math kernel libraries, DSP libraries and profiler tools. Tools like C++ Composer XE, Parallel Inspector, Trace Analyzer and Collector, and VTuneTM Amplifier XE are available for Windows® and Linux® users. Integrated Performance Primitives (IPP), the Math Kernel Library (MKL) and Thread Building Blocks (TBB) help to extract the most out of the platform and are worth the time to download and test.

Developers can take advantage of multi-threaded application development to improve performance, and the proper code will be critical to making this adjustment successfully. Utilizing the multiple cores for greater processing power requires significant attention to code migration and optimization to ensure maximum exploitation of Sandy Bridge’s processing power.

What It All Means

What is clear to most developers is that Sandy Bridge can truly be a game-changer, particularly for security, enterprise communications, telecommunications and storage applications. Any application involving transcoding or decryption, including most video-related applications, VoLTE and secure endpoint communications, should consider transitioning to the new platform sooner rather than later to take full advantage of processing power and ensure best performance of new and evolving applications.

This article was written by Austin Hipes, Vice president of Technology, NEI (Canton, MA). For more information, contact Mr. Hipes at This email address is being protected from spambots. You need JavaScript enabled to view it., or visit .