Authorea

James Shirley generating latex version of article about 11 years ago

Commit id: 9577e441a0e9ffa325e7ac849305850b656d9a0c

deletions | additions

\section{BayesC In CUDA} CUDA gives the programmer flexibility in how they parallelize their code. It provides powerful thread level parallelism that allows effective "unrolling" of loops and running each iteration simultaneously (conceptually). We targeted the inner loop which samples the effect each loci has for parallelization. \subsection{Loci Parallelization Overview} \begin{enumerate} \item Move Genotype and Phenotype data to GPU \item Launch kernel with one thread for each loci. \item Reform Matrices and Vectors on the GPU \item Compute effect of the loci \item Move data back to CPU \end{enumerate} \subsection{CUDA Library Usage} A shortcoming of the current CUDA architecture is that kernels can not launch other kernels and so.

\documentclass[]{article} \usepackage{amssymb,amsmath} \usepackage{ifxetex,ifluatex} \ifxetex \usepackage{fontspec,xltxtra,xunicode} \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase} \else \ifluatex \usepackage{fontspec} \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase} \else \usepackage[utf8]{inputenc} \fi \fi \ifxetex \usepackage[setpagesize=false, % page size defined by xetex unicode=false, % unicode breaks when used with xetex xetex, colorlinks=true, linkcolor=blue]{hyperref} \else \usepackage[unicode=true, colorlinks=true, linkcolor=blue]{hyperref} \fi \hypersetup{breaklinks=true, pdfborder={0 0 0}} \setlength{\parindent}{0pt} \setlength{\parskip}{6pt plus 2pt minus 1pt} \setlength{\emergencystretch}{3em} % prevent overfull lines \setcounter{secnumdepth}{0} \author{James Shirley} \begin{document} \newcommand{\truncateit} {[}1{]}\truncate{0.8\textwidth}{#1} \newcommand{\scititle} {[}1{]} \textbf{Abstract}. NVidia's CUDA framework has brought supercomputing to the masses allowing programmers to take advantage of the highly parallel capabilities of their Graphics Processing Units. We analyzed a popular Genomic Selection software's codebase and identified key areas where it could benefit from parallelization. Using the CUDA C++ language extensions, we did just that and found X speedup. \section{Introduction} Affordable Graphics Processing Units (GPUs) have revolutionized the personal computing industry. GPUs offer massively parallel, many-core processing capabilities at an affordable cost. NVidia's CUDA (Compute Unified Device Architecture) is a framework and an extension to the C language that gives programmers the ability to utilize the parallel architecture of the GPUs for general purpose programming. The general purpose programming language effectively gives the programmer a commodity supercomputer.\cite{1} The high-performance of general purpose graphics processing units (GPGPUs) has made it an attractive target for numerous numerical applications in science and engineering. GenSel is a piece of software written mainly by Rohan Fernando in C++ that performs analyses related to Genomic Selection using information about animals' Genotypes and Phenotypes to make inferences on the effects of each marker loci on the phenotypic output (?). It uses Bayesian analyses with MCMC methods to compute the posterior probabilities. Programmers have had success parallelizing algorithms using Monte Carlo Markov Chain (MCMC) methods in the past. This paper presents a description of where the GenSel software can be parallelized as well as some preliminary results of parallelizing the BayesC method. \section{BayesC Algorithm} Genomic selection involves using Pure Bred (PB) animals to improve performance when cross-breeding or breeding with other PB animals. Evaluating each animal for cross breeding performance involves estimating the effect of Single Nucleotide Polymorphism (SNP) on crossbred performance, using the phenotypes and genotypes from crossbreeds, and correlating them to purebred performance. \subsection{Bayesian Estimation of SNP Effects} Marker effects were estimated using the BayesC algorithm presented by Kizilkaya et al \cite{2}. The algorithm uses MCMC methods \subsection{Algorithm Overview} \begin{enumerate} \item For i in {[}1..chainLength{]} \begin{enumerate} \item Sample Residual Variance \item Sample the Intercept \item For each j in {[}0..numberOfLoci{]}: \begin{enumerate} \item Adjust Phenotypes for the current locus j \item Calculate variance for the current locus \item Sample from a uniform distribution \item If probability is less than random variable: Something \item Else: Something else \end{enumerate} \item Sample the locus effect variance \item Accumulate posterior mean of probability distribution \end{enumerate} \end{enumerate} \section{BayesC In CUDA} CUDA gives the programmer flexibility in how they parallelize their code. It provides powerful thread level parallelism that allows effective "unrolling" of loops and running each iteration simultaneously (conceptually). We targeted the inner loop which samples the effect each loci has for parallelization. \subsection{Loci Parallelization Overview} \begin{enumerate} \item Move Genotype and Phenotype data to GPU \item Launch kernel with one thread for each loci. \item Reform Matrices and Vectors on the GPU \item Compute effect of the loci \item Move data back to CPU \end{enumerate} \subsection{CUDA Library Usage} A shortcoming of the current CUDA architecture is that kernels can not launch other kernels and so. \section{Results} \section{Conclusion} \end{document}