General-Purpose Computation On Graphics Hardware

GPGPU  Overview

General-purpose computing on graphics processing units (GPGPU) is a popular computing technology utilized in a wide array of High Performance Computing (HPC) areas. We may find GPGPU being deployed in: SETI exploration, molecular modeling, computer node clusters, super-computers, cloud computing, virtualization, bio-informatics algorithms, cryptography, finance, and a plethora of other scientific and engineering applications requiring the ability to handle the computationally-intensive parallel problems, that we have traditionally relied upon CPUs to perform.

As a result of their floating-point computational capabilities and data parallel computing architecture, the GPU of today is widely deployed in an application acceleration capacity in the HPC arena.  Contemporary high performance computers equipped with GPU co-processors, typically execute parallel applications by the Single Program Multiple Data (SPMD) model, which requires a balance between the CPU computing resources and the GPU coprocessors to facilitate optimal system utilization. (Li et al., 2011)

GPUs are fast, particularly as compared with CPUs. As an example, a circa 2005 GeForce 6800 Ultra has an observed 53 GFLOPs, 35.2 GB/sec peak memory bandwidth, which contrasts markedly with the theoretical 12 GFLOPS, 5.96 GB/sec peak memory bandwidth of a 3 GHz Pentium 4. (Harris, 2005) As we have moved forward, we see supercomputer vendors incorporating GPUs in the compute blades of their parallel machines, such as in the SGI Altix UV and the Cray XK6 supercomputer. Of note, the 2nd ranked in the top 500 list of supercomputers, the Tianhe-1A utilizes NVIDIA Tesla M2050 GPUs, and has a sustained Linpack performance rating of 2.57 PFlop/s (Li et al., 2011)

Compute Unified Device Architecture

Why have GPUs gotten so fast and powerful, so rapidly? The simplistic answer is that the multi-billion dollar video game industry has driven the waves of innovation. However, the more accurate answer is that the specialized nature of GPUs makes it easier to use additional transistors for computation as opposed to cache. (Harris, 2005) This combined with the ability to program the GPUs and the implementation of CUDA has led to the explosive growth and implementation of the GPGPU.

GPGPU Programming

By virtue of allowing programming and writing of the CUDA kernel through the C programming language, which most programmers are quite familiar with, the CUDA parallel computing platform is able to provide marked and dramatic increases in the computing performance of the GPU. And while the open-source OpenCL is the dominant open general-purpose GPU computing language, typically found in personal computers, servers and handheld/embedded devices, it is NVIDIA’s “Compute Unified Device Architecture” (CUDA) that is the predominant proprietary framework.

However, one of the difficulties for current GPGPU programmer is writing code to utilize multiple GPUs. One limiting factor is that only a few GPUs can be attached to a PC, which means that the Message Passing Interface (MPI) would be the typical means to deploy multiple GPUs. However, implementing parallel based MPI code is difficult in comparison to serial based code.

As such, researchers have proposed the Distributed-Shared Compute Unified Device Architecture (DS-CUDA), which is a middleware to streamline code for multiple GPUs distributed on a network. By being implanted at the source-code level, DS-CUDA would offer a global view of GPUs, and allow a cluster of GPU equipped systems to appear as if they were one unit, albeit with multiple GPUs. As a proof of concept, the researchers deployed 22-nodes (64-GPUs) on a TSUBAME 2.0 supercomputer. (Kawai et al., 2012)

GPGPU Mechanics

As contrasted with CPUs which are quite capable of performing one or two tasks at a time, GPUs are quite proficient at performing a huge number of tasks at the same time. To accomplish these Herculean accomplishments, GPUs utilize hundreds of arithmetic and logic units (ALU) to perform its’ integer arithmetic and logical operations. As the NVIDIA GPU ALUs are programmable, we can see why they can quite easily compute the colors of 2,304,000 pixels of a 20 inch monitor.

GPPU Applications and Usage

GPGPU Implementation

            Optimally, CUDA enabled GPUs excels at highly parallel algorithms, with hundreds of threads. However, its’ use with serial algorithms is ill-advised, as is the use of it with less than a thousand threads. A CUDA enabled GPU would also be a good choice for number crunching, and may easily perform 32-bit integer and floating point operations, and is also capable of 64-bit integer and floating point operations. Furthermore, as compared with the memory interface of the CPU, the GPU is superior, being upwards of ten times as fast. As such, the GPU is well suited for deployment with large datasets.

Statistical Physics

Statistical physics implements probability theory and statistics with a large sample group in the solving of physical problems.  Typically it consists of employing a stochastic approach to obtain values from a corresponding sequence of jointly distributed random variables. (Rouse, 2005)  The statistical physics methodology may be utilized by a wide array of computer science applications, computational physics, and other interdisciplinary fields of science such as chemistry biology, neurology, as well as the social sciences, and quantitative finance. (Vogel, 2010)

To this end GPGPU computing is very useful, particularly in the implementation of Monte Carlo simulations to the science fields and finance. Monte Carlo simulation is used for the computer simulation of many-body systems as well as for traffic flow, stock market fluctuation analysis and other related areas of computation. By the use of random variables generated by a computer, probability distributions are calculated by running multiple simulations, which will allow a probability estimation of the properties of various systems. (Binder & Heermann, 2010)

Econophysics. Large banks and other financial entities also make use of GPGPUs to make high-frequency trading (HFT) automatic trading decisions. Algorithms calculated by GPGPU analyze the flow of incoming information received from the exchange system, and will determine if a buy or sell order is warranted.  In addition the use of GPGPUs can aid in the reduction of round-trip time (RTT) calculations, as the GPGPUs determine locations which are closest to the computer infrastructure of the central server of the exchange. (Preis, 2011)

Cloud computing. Cloud computing utilizes multi- GPGPUs formations to allow cloud servers to host multiple virtualized operating systems, and offer high performance computing at a low price. On the server side of the cloud, the GPGPUs may run at full capacity, but at the same time enable greater energy efficiency with a resultant lower environmental impact (Fremal, Bagein & Manneback, 2012)

Cryptography. By the use of CUDA on GPGPUs, cryptographical hash processing of password cracking applications may be parallelized.  In one study, researchers were able to reduce the hash processing time of John the Ripper on a dual-core CPU to 0.03% of the original, by parallelizing with a GPGPU. (Murakami, Kasahara, & Saito, 2010)

In another study, the researchers deployed an off-the-shelf Nvidi 8800GTS GPU as an accelerator for RSA and DSA asymmetric cryptosystems, as well as for Elliptic Curve Cryptography (ECC). They were able to compute 813 modular exponentiations per second for the RSA and DSA-based systems with 1024 bit integers, and a throughput of 1412 point multiplications per second on the ECC system. (Szerwinski & Güneysu, n.d)

Conclusion

We have seen that GPGPUs may  be deployed in statistical physics, computer science applications, chemistry biology, neurology, the social sciences, and quantitative finance. It has also received wide usage in virtualized operating systems on cloud servers, as well as in cryptoanalysis. Furthermore, as a result of ongoing developments and enhancements to the general-purpose computing on graphics processing unit, we are certain to see its’ utilization continue to proliferate.