Authorea

Ben Hirsch generating latex version of article about 11 years ago

Commit id: 124d82820deddefcaad755e0770fb0bd40df9b5a

deletions | additions

\subsection{Bayesian Estimation of SNP Effects} Marker effects were estimated using the BayesC algorithm presented by Kizilkaya et al \cite{kizilkaya}. A high level overview of The inputs for the algorithm follows. \subsection{Algorithm Overview} \begin{enumerate} \item For are a list of marker loci and potential phenotypes that they represent. A matrix of daughter yield deviation vectors for each animal i in {[}1..chainLength{]} \begin{enumerate} \item Sample Residual Variance \item Sample the Intercept \item For is formed where each j in {[}0..numberOfLoci{]}: \begin{enumerate} \item Adjust Phenotypes row is a vector defined by the following algorithm: the vector representing the effect corrections (the adjusted effects caused by each loci for each animal i) for animal i plus the background gene effects plus the environmental effects, plus the size of the current locus j \item Calculate variance QTL effects for each parent. The goal is to calculate the current locus \item Sample from a uniform distribution \item If joint posterior probability density of the unknown parameters mentioned in the algorithm previously (every vector except the effects caused by each loci for each animal). The posterior probability is less than random variable: the probability that an event occurs given recent evidence. Bayes C estimates the probability the parameters affect the given phenotypes by finding the prior information about how the probability the phenotypes would be affected and the likelihood of the parameters occuring. The locus had a significant effect. Calculate issue with computing these probabilities is that there is often no closed form for computing the variance parameters or would be unfeasible due to integrating over multiple dimensions. So instead samples are drawn from the locus probability distribution. This sampling starts with a valid starting point and record that it was fitted. \item Else: Record that draws samples from the locus does not fit. \end{enumerate} \item Sample distribution, measuring how each parameter (or loci) affects those after it. It also takes into account the locus effect variance \item Accumulate posterior mean prior parameter's affect on the prior loci. This is called Gibbs sampling, and by iterating a large number of probability times it forms a Markov Chain with a stationary distribution \end{enumerate} \end{enumerate} estimating the probability function of the parameters' affect. \section{CUDA Implementation} We chose two different tactics for our CUDA implementation. On one hand,

will consist of identifying sections of code (often loops) where multiple sets of data are used with the same code. To effectively utilize the multi-thread model each thread should execute without dependencies on each other. We identified believed the section of code focusing area with greatest potential for speedup on the GPU was the loop sampling the effects of for each marker as the most parallelizable. locus which might run with 50 thousand to 700 thousand markers. Thenew high level description of the code is as follows: \begin{enumerate} \item For for this loop follows \begin{verbatim} void sampleEffects() { Eigen::Map currentDataLocus(genotypeDataPointer, numObs); Eigen::Map currentTestLocus(genotypeTestPointer, testObs); for (unsigned i = 0; i < numberMarkers; i++) { new (¤tDataLocus) Eigen::Map (&dataPtr->X(0, i), numObs); new (¤tTestLocus) Eigen::Map (&testPtr->X(0, i), testObs); float oldSamplei = solSample[i]; float rhsModeli = (currentDataLocus.dot(RInverseY)) + diagLhs(i) * oldSamplei; rhsModeli *= invVarResidual; float Lhs = diagLhs[i] * invVarResidual + invVarEffects; float probDeltaOne = 1.0f / (1.0f + exp(logDeltaZero + 0.5f * (log(Lhs) + logVarEffects - rhsModeli*rhsModeli/Lhs) - logPiComp)); if (matvec::ranf() < probDeltaOne) { sigma[i] = varEffects; // sigma[i] is used in {[}1..chainLength{]} \begin{enumerate} \item Sample Residual Variance \item Sample the Intercept \item Move genotype, phenotype data to GPU and allocate space for kernel data coming back \item Launch a kernel with one thread sampleCommonSigma solSample[i] = matvec::snorm() * sqrt(1.0f/Lhs) + rhsModeli/Lhs; meanEffects[i] += (solSample[i] - meanEffects[i]) * invThisIter; sdDelta1[i] += (solSample[i] * solSample[i] - sdDelta1[i]) * invThisIter; mlFreq[i] += (1.0f - mlFreq[i]) * invThisIter; estLocusVars[i] += (varEffects - estLocusVars[i]) * invThisIter; RInverseY += ((currentDataLocus.array() * dataPtr->vectorRInverse.array()) * (oldSamplei - solSample[i])).matrix(); genotValueVec += currentDataLocus * solSample[i]; if (iteration % outputFreq == 0) { testGenotValueVec += currentTestLocus * solSample[i]; if (wantWindowBV) { windowBV.col(dataPtr->WindowcM[i]) += currentDataLocus * solSample[i]; cumSample[numberEffects].icolumn = i; // store samples cumSample[numberEffects].fSample = solSample[i]; } } int istart = max(0, int(i - windowWidth)); unsigned iend = min(i + windowWidth + 1, numberMarkers); for each locus (unsigned j \begin{enumerate} \item Adjust Phenotypes for the current locus = istart; j \item Calculate variance for < iend; j++) windowStatus[j] = 1.f; // record this window had a SNP fitted in the current locus \item Sample model numberEffects++; } else { if (oldSamplei) RInverseY += ((currentDataLocus.array() * dataPtr->vectorRInverse.array()) * oldSamplei).matrix(); sigma[i] = 0.0; solSample[i] = 0.0f; meanEffects[i] -= meanEffects[i] * invThisIter; sdDelta1[i] -= sdDelta1[i] * invThisIter; mlFreq[i] -= mlFreq[i] * invThisIter; estLocusVars[i] -= estLocusVars[i] * invThisIter; } } // end of loop over markers (i is the loop variable) /*float*/ genVar = calc_variance(genotValueVec); /*float*/ meanGV += (genVar - meanGV) * invThisIter; /*VectorXf*/ mlFreqWindow += (windowStatus - mlFreqWindow) * invThisIter; /*VectorXf*/ windowStatus.setZero(); } \end{verbatim} The code is fairly straightforward; it's mainly a series of floating point calculations and it samples from auniform distribution \item If probability is less than random variable: Sample function to determine whether the locus effect gets a sample or does not. The main barrier against parallelization takes place where the comments in the code have marked "ISSUE A", B, and C. At the beginning of each iteration the variable rhsModeli is calculated from a dot product between the current locus data and record that it was fitted. \item Else: Record that RInverseY (ISSUE A). But at "ISSUE B," RInverseY is adjusted if the locus was gets a sample. Even if it does not fitted. \end{enumerate} \item Move data back from GPU \item Sample get a sample, "ISSUE C" highlights that RInverseY may still be adjusted if the locus effect variance \item Accumulate posterior mean of probability distribution \end{enumerate} \end{enumerate} Because previously got a sample on the number last iteration of the Markov Chain. We spoke with Dorian Garrick, a contributor to the Bayes C algorithm and codebase, who told us that the marker loci can potentially will only be very large (potentially 1000s) and sampled in about 1-5\% of the process behind sampling each marker's effects is independent from iterations in the other, sample effects loop we thought are targeting with 50 thousand loci, and that percentage drops to 0.01-0.05\% for 700 thousand. Our revised pseudocode algorithm to skirt around this was issue follows: \begin{verbatim} sampleEffects() { // copies data from Eigen objects to the GPU copyDataToGPU(); int sampleIndex = -1; while (true) { if (notFirstTime) copyRInverseYToGPU(); // sampleIndex is which sample previously got a prime target sample launchSampleEffectsKernel(sampleIndex); copyDataFromGPU(); for parallelization. Where (loci in [ sampleIndex + 1 .. numberMarkers]) { if (hasSample(loci)) { adjustRInverseY(); numberOfEffects++; break; } else if (hasPreviousSample(loci) { adjustRInverseY(); break; } } } copyDataFromGPU(); } \end{verbatim} The idea of the sequential approach would be an iterative loop with algorithm is still to launch a counter ranging over each marker, instead CUDA's thread model allows scaling kernel to compute the problem over effects of each marker by identifying loci in parallel, but upon completion of each one uniquely kernel it checks to see if RInverseY should have been adjusted for any of the loci. It adjusts it, then re-launches the kernel with the previously sampled marker index so that previous marker loci don't run. This presents obvious overhead in that to calculate the effects for each marker the algorithm potentially does as many kernel launches as there are marker loci. Additionally, there is more time wasted copying RInverseY to the GPU whenever it needs to be adjusted. While the algorithm may run in parallel for some marker loci, the cost is high for those iterations of the Markov Chain with more than a thread index. few samples in the set. Each time the kernel returns all remaining marker loci have to be checked for samples. \subsection{Parallelizing Linear Algebra using cuBLAS}

that rather than using a CPU-bound library for Linear Algebra math called Eigen, we instead used cuBLAS to do all of the data manipulation. Using cuBLAS is fairly simple. For example, to parallelize a dot product on two single precision floating arrays you can use the cublasSdot() function. This function simply takes 2 floating point arrays, the size of the vectors, and a return float array as well as some helper parameters. The obvious benefit to using functions such as this is that it takes out the tedious work of defining a kernel for what would be a library call in CPU space. For our working set we have used cuBLAS to parallelize a dot product in the sampleEffectsBayesC function where the currentDataLocus needs to be combined with the current RInverseY using a dot product. The original code base did this like so: \begin{verbatim} float rhsModeli = (currentDataLocus.dot(RInverseY)) + diagLhs(i) * oldSamplei; \end{verbatim} To do this in cuBLAS you can do the following: \begin{verbatim} cudaMalloc((void**)&devCurrentDataLocus,numObs*sizeof(float)); cudaMalloc((void**)&devRInverseY,numObs*sizeof(float)); cudaMemcpy(devCurrentDataLocus,currentDataLocus.data(),numObs*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy(devRInverseY,RInverseY.data(),numObs*sizeof(float),cudaMemcpyHostToDevice); float rhsModeli; cublasSdot(cublasHandle,numObs,currentDataLocus.data(),0,RInverseY.data(),0,&rhsModeli); rhsModeli += (diagLhs(i) * oldSamplei); \end{verbatim} You must also setup the handle that cuBLAS uses to access the GPU, this can be done like so: \begin{verbatim} cublasHandle = 0; cublasStatus_t cublasStatus; cublasStatus = cublasCreate(&cublasHandle); \end{verbatim} Although in this example it looks like using cuBLAS is more complicated normally you could not simply call dot() on a float array, that is only allowed above because currentDataLocus and RInverseY are matvec objects. Similarly, in the cuBLAS code we use the .data() function of the matvec vector to extract the underlying float array. \section{Results} For testing purposes we used a system with a EVGA 560ti running centOS 6.3. The average run time for the linear BayeC algorthim was 20.689s. \subsection{Sequential Algorithm} The five most used methods of the sequential GenSel program follows: \begin{verbatim} Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ks/call Ks/call name 97.88 1396.70 1396.70 41000 0.00 0.00 Bayes::sampleEffectsBayesC() 1.00 1411.00 14.30 137046 0.00 0.00 void Eigen::internal::solve_retval >, Eigen::CwiseNullaryOp, Eigen::Matrix > >::evalTo >(Eigen::Matrix&) const 0.42 1417.04 6.04 1 0.01 1.42 Bayes::singleSiteBayesC() 0.32 1421.59 4.55 1 0.00 1.43 Bayes::setup(Options&, Data&, Data&) 0.22 1424.78 3.19 41000 0.00 0.00 Bayes::sampleOther() \end{verbatim} As expected the majority of the computation time is spent in sampleEffectsBayesC which takes over 23 minutes. The next most is a matrix operation, the loop that comprises the Markov Chain, setup, and a similar sampling loop to the main effect sampling loop. \subsection{Parallelizing across Loci} Unfortunately our kernel parallelizing across loci does not work all the way through right now. Currently it can complete seven iterations of the Markov Chain in roughly 3 minutes on the CSC Lab Machines which is orders of magnitude less than the sequential version which can finish over 100 iterations of the Markov Chain in under a second. We believe the main barrier the performance is the numerous kernel launches which slows the program down between checking for samples, readjusting the RInverseY vector, and copying data across to the GPU for another kernel launch. \subsection{Parallelizing Linear Algebra} When we attempted to make a dot product parallel using cuBlasSdot() our result was actually a significant slowdown. When changing the aforementioned dot product to use the cuBLAS API our execution time increased to 164.972s. We analyzed our changes using nvprof to determine where the extra time was coming from and discovered that 100 \begin{verbatim} ======== Profiling result: Time(%) Time Calls Avg Min Max Name 100.00 864ns 1 864ns 864ns 864ns [CUDA memcpy HtoD] 0.00 0ns 1 0ns 0ns 0ns void dot_kernel(cublasDotParams) 0.00 0ns 1 0ns 0ns 0ns void reduce_1Block_kernel(float*, int, float*) \end{verbatim} This is an unusual result and must be further investigated. One possibility is that, since the dot product is not the most computationally expensive operation and the data set is small, the amount of time spent in the kernel is just too small for nvprof to measure. If this is the case then the cost of 864ns for transferring memory does not pay off. \section{Conclusion} Our efforts to create a parallel version of the BayesC algorithm used in GenSel have not been very successful so far. Creating an entirely new kernel for the algorithm has been especially time consuming and difficult. The complexity of the singly threaded application and its many dependencies on third party libraries that simply do not interface well with the CUDA kernel launching mechanisms has made this task difficult to accomplish. The cuBLAS libraries offered a simple to use way of making linear algebra (such as the dot product example we explored) run in parallel on the many cores of a GPU. These libraries, although useful, may not be applicable to our project. As far as we can tell there are no large matrix multiplications and the most calculation dot products and element-wise array multiplication. These calculations can be done in parallel quite easily and very quickly however they are not very computationally expensive whether they are done in parallel or not. In any case, it would appear that the cost of moving memory from the CPU to the GPU (and back) was greater than the benefit we received in doing the calculation on the GPU. While we hoped that a kernel computing effects for each marker in parallel would offer a substantial speedup so far the results have not been there. Similar to the problems with the linear algebra library it may be that copying the data to the GPU takes more time than computing each locus sequentially. But there is definite room for optimization, and the algorithm is not fully functioning yet so it may be too soon tell. \end{document}