Many scientific, technical and engineering applications in finance, medical imaging, modeling, simulation, and image processing can benefit greatly from the floating point acceleration offered by modern general purpose graphics processing units (GPGPU). Today’s graphics processors have evolved into sophisticated, massively-parallel, highly-programmable compute engines ideally suited for algorithms with a high degree of data parallelism. Combined with modern parallel programming languages and application interfaces such as industry standard OpenCL™, GPGPU offers a new paradigm for high-performance computing (HPC).
In this article we plan to examine the use of GPGPU to accelerate iterative, grid-based, finite-difference methods used extensively in HPC applications. Finite-Difference Time-Domain (FDTD) solvers are used to model electromagnetic wave propagation. Velocity-stress finite-difference time-domain (VS-FDTD) solvers are used to model acoustic wave propagation in seismic applications. Both algorithms are ported to the ATI Radeon™ HD 5870 and ATI Radeon™ HD 5970 GPU co-processors and benchmarked against a CPU implementation. Let’s start with an overview of the ATI Radeon HD 5870 GPU.
ATI Radeon HD 5870 Graphics Processor
Overview: The ATI Radeon HD 5870 GPU is a high-performance graphics processor delivering 2.72 teraFLOPS (peak) single precision floating point performance, at 14.47 gigaFLOPS/ Watt. The GPU supports the OpenCL 1.0 API and complies with the IEEE 754-2008 floating point arithmetic standard. Configured with 1 GByte of GDDR5 memory, the GPU supports 153 GBytes/sec of memory bandwidth to local memory. Communication with a host CPU is via the x16 PCI Express® 2.0 bus.
Architecture: For GPGPU applications, key functional blocks include the 20 single-instruction, multiple-data (SIMD) engines, thread dispatch controller, memory controller, instruction / constant / memory caches and 64 Kbytes of global memory (Figure 1). Each of the SIMD engines contain an array of 80 stream processor elements (PEs), a 32 KByte local data store, a fetch unit with control logic and an 8 KByte L1 cache. Each SIMD engine is able to run its own instruction threads. Communication with other SIMDs is through the 64 KByte global memory. To maximize computation throughput and hide memory latency, the ultra-threaded dispatch processor launches multiple, simultaneous thread contexts. Memory bandwidth is optimized by four memory controllers connected to four 128 KByte L2 caches that coalesce memory reads and writes to drive the external 256-bit wide memory bus.
Each SIMD engine contains 80 stream PEs organized into 16 thread processors. Each thread processor consists of 5 PEs, a branch unit and general purpose registers. The 5 PEs include 4 32-bit general purpose floating point multiply and add units and 1 special purpose unit to support floating point multiply, add, and transcendental functions.
Performance: Using the GPGPU Computational Performance benchmark from SiSoftware Sandra 2010, the ATI Radeon HD 5870 GPU scores 1820 Mpixels/s.[a] SiSoftware Sandra calculates the Mandelbrot set using OpenCL. Benchmarking with ComputeMark, the ATI Radeon HD 5870 GPU scores 2212.[b] ComputeMark is a DirectX® 11 Shader Compute benchmark and is well suited to measure GPGPU performance, as it is capable of utilizing 99% of GPGPU resources.
OpenCL: The Open Computing Language, or OpenCL, is the open standard for parallel programming of GPGPUs and multi-core CPUs. OpenCL provides for a portable and a high-performance framework for the development of cross platform and vendor independent applications. An OpenCL application is partitioned into two primary components comprised of one or more computational kernels that are executed on the compute devices, and a host program that coordinates the asynchronous execution of the kernels and manages the movement of data to distributed memory resources.
GPGPU Implementation of Finite-Difference Time-Domain Solvers
Finite-Difference Time-Domain Methods: Finite-Difference Time-Domain (FDTD) methods are used to iteratively solve time-dependent partial differential equations by discretizing the equations and representing the solution on a spatial grid. The solution is evolved in time using an iterative time-stepping algorithm. An important feature of FDTD algorithms is the use of a local stencil so that the solution at any point in the grid is dependent upon the values of neighboring grid points. FDTD methods are used in many fields of scientific modeling and simulation and form the basic computational kernel upon which many software packages are built.
Electromagnetics: The FDTD method is used in computational electromagnetics where the time-dependent relationship between electric and magnetic field components is governed by Maxwell’s equations,
Using the FDTD method , the electric (E) and magnetic (H) field vectors are represented on staggered-grids and the solution is generated by updating E and H using a leap-frog algorithm that approximates the time-dependent solution through small discrete time-steps. The result is a time-varying solution that describes the propagation of electromagnetic waves. The FDTD method for electromagnetics is used for modeling and simulation in military and commercial applications ranging from radar to radio communications.
Seismic: The FDTD method is also used to model seismic wave propagation governed by the elastic wave equation,
The time-dependent solution describes the propagation of acoustic waves in an elastic medium. The solution for the particle velocity vector (ν) and stress tensor (σ) is represented on staggered-grids similar to those used in computational electro-magnetics. The Velocity-Stress FDTD (VS-FDTD) method  is an important algorithm used in seismic forward modeling with applications to oil and gas exploration and also military applications for detecting buried structures.
GPU Implementation: The basic FDTD algorithm is implemented using OpenCL targeting the ATI Radeon HD 5870 and ATI Radeon HD 5970 GPU co-processors. The E and H vector fields are represented on three-dimensional spatial grids and iteratively updated using local stencil operations that approximate the derivatives in the governing equations. The FDTD algorithm exhibits a low ratio of computation to memory access making data flow the primary consideration. The E and H vector fields are stored in separate arrays, with the vector components interlaced in stripes along the fastest index. The grids are re-ordered along the fastest index to optimize the use of SIMD operations on the GPU by aligning values used in non-local stencil operations. The updates for E and H are performed using separate OpenCL kernels executed in succession to approximate the time-evolution of the quantities over discrete time-steps.
The GPU implementation of the VS-FDTD algorithm for seismic wave propagation has a similar structure, but differs in two ways. First, the amount of data represented per grid point is greater since the model is comprised of a vector and symmetric tensor field (3+6 components) as compared with the two vector fields (3+3 components) used for electromagnetics. Furthermore, the seismic application uses more spatially dependent physical parameters. Second, the complexity of the computation of the updates is slightly more complex for VS-FDTD.
In order to utilize the dual GPUs on the ATI Radeon HD 5970, steps are introduced to exchange the edges of the sub-grids over which the simulation is distributed. The algorithm performs two iterative updates of each quantity and then exchanges edges sufficient to join the sub-grids. This technique can be extended to larger simulations utilizing additional GPUs per node or multiple nodes, where the latter case would require the use of MPI for inter-node communication.
Figure 2 illustrates the resulting electromagnetic (top) and seismic (bottom) simulations obtained by executing the FDTD and VS-FDTD algorithms on the ATI Radeon HD 5870 and ATI Radeon HD 5970 GPUs respectively.
Performance: Benchmarks were performed for both FDTD and VS-FDTD algorithms using an ATI Radeon HD 5870 GPU and ATI Radeon HD 5970 GPU, along with a standard C implementation targeting a CPU as a reference (Table).[c] The multi-core performance was extrapolated from results for a single core implementation on a hexacore processor, which places an upper bound on performance that is unlikely to be achieved with six (6) cores competing for resources. Performance is reported in terms of the time per iterative update normalized to a million-point grid.
The results demonstrate that these algorithms can be accelerated using an OpenCL implementation targeting GPUs. Relative performance between the GPU and CPU is very similar for both algorithms, with the seismic VS-FDTD showing slightly better acceleration due to the increased computational complexity. For the electromagnetic FDTD and seismic VS-FDTD algorithms, the acceleration using a single-GPU ATI Radeon HD 5870 is 2.9x and 3.2x, respectively. The acceleration factors are increased to 3.7x and 4.2x, respectively, for the dual-GPU ATI Radeon HD 5970, where the additional overhead of swapping edges between sub-grids and the reduced clocks prevents perfect scaling.
[a] System configuration: AMD Phenom™ II X4 940 processor-based system, 3 GHz, ASUSTek M3A79-T DELUXE, 4BG DDR2-166, Windows® 7 64-bit Enterprise OS, ATI Radeonô HD 5870 with 1GB memory, GPU clock 850 MHz, memory clock 1.2 GHz, Driver 8.680.0.0, OpenCL base build, SDK 2.0 Beta 4.
[b] System configuration: AMD Phenom II X4 940 3 GHz (Deneb), ASUSTeK M3A79-T DELUXE, 4GB Kingston DDR2-1066 MHz, Windows 7 64-bit ATI Radeon™ HD 5870 with 1GB memory, GPU clock 850 MHz, memory clock 1.2 GHz, ATI Driver: 8.73 base kit.
[c] System configuration (i) ATI Radeon HD 5870 single GPU, Memory: 1GB GDDR5, 850 MHz core clock, 1200 MHz memory clock, Catalyst 10.2, ATI Stream SDK V2.1, BDT coprthr_sdk: v1.0-RC1 (ii) Visiontek Radeon HD 5970 dual-GPU, Memory: 2GB GDDR5, 725 MHz core clock, 1000 MHz memory clock, Catalyst 10.2, ATI Stream SDK V2.1, BDT coprthr_sdk: v1.0-RC1 (iii) ASUS M4A79T motherboard, CPU: AMD Phenom II X6 1090T BE @ 3.2 GHz, Memory: 4×4GB DDR3-1600 (9-9-9-24 timing), OS: Linux CentOS 5.4 x86_64.
 A. Taflove and M.E. Brodwin, IEEE Trans. Microwave Theory Techn., Vol. 23, pp. 623-630, (1975).
 S.E. Minkoff, J. Sci. Comput. Vol. 24, No. 1, pp 1-19 (2002).