Parallelizing Genomic Selection software using GPUs


Abstract. NVidia’s CUDA framework has brought supercomputing to the masses allowing programmers to take advantage of the highly parallel capabilities of their Graphics Processing Units. We analyzed a popular Genomic Selection software’s codebase and identified key areas where it could benefit from parallelization. Using the CUDA C++ language extensions, we identified areas in GenSel that could be parallelized and also discovered some issues associated with our attempts to parallelize the critical sections. We also utilized a CUDA Linear Algebra Library that targets speeding up linear algebra computations by moving the work to the GPUs and examined how that impacted the run time.


Affordable Graphics Processing Units (GPUs) have revolutionized the personal computing industry. GPUs offer massively parallel, many-core processing capabilities at an affordable cost. NVidia’s CUDA (Compute Unified Device Architecture) is a framework and an extension to the C language that gives programmers the ability to utilize the parallel architecture of the GPUs for general purpose programming. The general purpose programming language effectively gives the programmer a commodity supercomputer.(GPU-Accelerated Diffe...)

The high-performance of general purpose graphics processing units (GPGPUs) has made it an attractive target for numerous numerical applications in science and engineering. GenSel is a piece of software written in C++ mainly by Rohan Fernando of Iowa State Universiy which performs analyses related to Genomic Selection that uses genetic information from animals to make inferences on the effects of each marker loci on the phenotypes recorded. It uses Bayesian analyses with Monte Carlo Markov Chain (MCMC) methods to compute the posterior probabilities (citation not found: rohan). Bayesian analysis is a statistical method of analysis that infers future probabilities based on past events recorded. MCMC are a class of methods intended to sample how likely something is to occur by computing prior probabilities and sampling based on those.

Programmers have had success parallelizing algorithms using Monte Carlo Markov Chain (MCMC) methods in the past. This paper presents a description of where the GenSel software can be parallelized as well as some preliminary results of parallelizing the BayesC method (Dumont, Parallel Metropolis c...).

CUDA Overview

CUDA is a proprietary toolchain developed and maintained by Nvidia that allows developers to compile, debug, and profile C/C++ programs and CUDA kernels for use on GPGPUs.

CUDA Thread Model

CUDA kernels are sets of instructions executed in parallel by a batch of threads; this follows the Single Instruction Multiple Data (SIMD) espoused by CUDA where each thread runs a program reading and writing from a shared global memory space. The threads are logically divided into blocks with a unique identifier existing for each thread. Blocks are also part of a larger unit called the grid; this grid can be launched with blocks in two dimensions while each block can be launched with three dimensions of threads. The distinction between dimensions serves purely as an abstraction for the programmer within the kernel so they can identify which thread executing should operate on which data.

CUDA Memory Model

Similar to a CPU’s architecture, the GPU also has multiple types of memory available to it. At the highest level available to all threads is global and constant memory. When moving non-primitive values from the host to the GPU the developer must either copy into the GPU’s global or constant memory. Constant memory is significantly faster than global memory however it is immutable when executing CUDA kernels. Once the memory is on the GPU the developer can either access it directly from all threads and all blocks or subdivide the memory into local or shared memory. Local memory is accessible on a per thread basis and is the fastest type of memory available whereas shared memory is accessible to all threads within a block.

BayesC Algorithm

Genomic prediction is used to predict effects of chromosome fragments in some training population of data. The predicted effects are then crosschecked against a validation population. Most commercial livestock is multibreed (MB) or crossbred (CB), while the seed stock are typically purebred animals (PB). Evaluating each marker for its performance involves estimating the effect of Single Nucleotide Polymorphism (SNP). Training the data on MB animals allows selection of PB for performance, while training on PB helps to predict and quantify the potential for performance when crossbreeding with MB animals.

Bayesian Estimation of SNP Effects

Marker effects were estimated using the BayesC algorithm presented by Kizilkaya et al (Genomic prediction of...). The inputs for the algorithm are a list of marker loci and potential phenotypes that they represent. A matrix of daughter yield deviation vectors for each animal i is formed where each row is a vector defined by the following algorithm: the vector representing the effect corrections (the adjusted effects caused by each loci for each animal i) for animal i plus the background gene effects plus the environmental effects, plus the size of the QTL effects for each parent. The goal is to calculate the joint posterior probability density of the unknown parameters mentioned in the algorithm previously (every vector except the effects caused by each loci for each animal).

The posterior probability is the probability that an event occurs given recent evidence. Bayes C estimates the probability the parameters affect the given phenotypes by finding the prior information about how the probability the phenotypes would be affected and the likelihood of the parameters occuring. The issue with computing these probabilities is that there is often no closed form for computing the parameters or would be unfeasible due to integrating over multiple dimensions. So instead samples are drawn from the probability distribution. This sampling starts with a valid starting point and draws samples from the distribution, measuring how each parameter (or loci) affects those after it. It also takes into account the prior parameter’s affect on the prior loci. This is called Gibbs sampling, and by iterating a large number of times it forms a Markov Chain with a stationary distribution estimating the probability function of the parameters’ affect.