Parallelizing Genomic Selection software using GPUs
Abstract. NVidia’s CUDA framework has brought supercomputing to the masses allowing programmers to take advantage of the highly parallel capabilities of their Graphics Processing Units. We analyzed a popular Genomic Selection software’s codebase and identified key areas where it could benefit from parallelization. Using the CUDA C++ language extensions, we identified areas in GenSel that could be parallelized and also discovered some issues associated with our attempts to parallelize the critical sections. We also utilized a CUDA Linear Algebra Library that targets speeding up linear algebra computations by moving the work to the GPUs and examined how that impacted the run time.
Affordable Graphics Processing Units (GPUs) have revolutionized the personal computing industry. GPUs offer massively parallel, many-core processing capabilities at an affordable cost. NVidia’s CUDA (Compute Unified Device Architecture) is a framework and an extension to the C language that gives programmers the ability to utilize the parallel architecture of the GPUs for general purpose programming. The general purpose programming language effectively gives the programmer a commodity supercomputer.(GPU-Accelerated Diffe...)
The high-performance of general purpose graphics processing units (GPGPUs) has made it an attractive target for numerous numerical applications in science and engineering. GenSel is a piece of software written in C++ mainly by Rohan Fernando of Iowa State Universiy which performs analyses related to Genomic Selection that uses genetic information from animals to make inferences on the effects of each marker loci on the phenotypes recorded. It uses Bayesian analyses with Monte Carlo Markov Chain (MCMC) methods to compute the posterior probabilities (citation not found: rohan). Bayesian analysis is a statistical method of analysis that infers future probabilities based on past events recorded. MCMC are a class of methods intended to sample how likely something is to occur by computing prior probabilities and sampling based on those.
Programmers have had success parallelizing algorithms using Monte Carlo Markov Chain (MCMC) methods in the past. This paper presents a description of where the GenSel software can be parallelized as well as some preliminary results of parallelizing the BayesC method (Dumont, Parallel Metropolis c...).
CUDA is a proprietary toolchain developed and maintained by Nvidia that allows developers to compile, debug, and profile C/C++ programs and CUDA kernels for use on GPGPUs.
CUDA kernels are sets of instructions executed in parallel by a batch of threads; this follows the Single Instruction Multiple Data (SIMD) espoused by CUDA where each thread runs a program reading and writing from a shared global memory space. The threads are logically divided into blocks with a unique identifier existing for each thread. Blocks are also part of a larger unit called the grid; this grid can be launched with blocks in two dimensions while each block can be launched with three dimensions of threads. The distinction between dimensions serves purely as an abstraction for the programmer within the kernel so they can identify which thread executing should operate on which data.
Similar to a CPU’s architecture, the GPU also has multiple types of memory available to it. At the highest level available to all threads is global and constant memory. When moving non-primitive values from the host to the GPU the developer must either copy into the GPU’s global or constant memory. Constant memory is significantly faster than global memory however it is immutable when executing CUDA kernels. Once the memory is on the GPU the developer can either access it directly from all threads and all blocks or subdivide the memory into local or shared memory. Local memory is accessible on a per thread basis and is the fastest type of memory available whereas shared memory is accessible to all threads within a block.
Genomic prediction is used to predict effects of chromosome fragments in some training population of data. The predicted effects are then crosschecked against a validation population. Most commercial livestock is multibreed (MB) or crossbred (CB), while the seed stock are typically purebred animals (PB). Evaluating each marker for its performance involves estimating the effect of Single Nucleotide Polymorphism (SNP). Training the data on MB animals allows selection of PB for performance, while training on PB helps to predict and quantify the potential for performance when crossbreeding with MB animals.
Marker effects were estimated using the BayesC algorithm presented by Kizilkaya et al (Genomic prediction of...). The inputs for the algorithm are a list of marker loci and potential phenotypes that they represent. A matrix of daughter yield deviation vectors for each animal i is formed where each row is a vector defined by the following algorithm: the vector representing the effect corrections (the adjusted effects caused by each loci for each animal i) for animal i plus the background gene effects plus the environmental effects, plus the size of the QTL effects for each parent. The goal is to calculate the joint posterior probability density of the unknown parameters mentioned in the algorithm previously (every vector except the effects caused by each loci for each animal).
The posterior probability is the probability that an event occurs given recent evid