Ben Hirsch generating latex version of article  about 11 years ago

Commit id: 124d82820deddefcaad755e0770fb0bd40df9b5a

deletions | additions      

       

\subsection{Bayesian Estimation of SNP Effects}  Marker effects were estimated using the BayesC algorithm presented by  Kizilkaya et al \cite{kizilkaya}. A high level overview of The inputs for  the algorithm follows.  \subsection{Algorithm Overview}  \begin{enumerate}  \item  For are a  list of marker loci and potential phenotypes that they represent. A  matrix of daughter yield deviation vectors for each animal  i in {[}1..chainLength{]}  \begin{enumerate}  \item  Sample Residual Variance  \item  Sample the Intercept  \item  For is formed  where  each j in {[}0..numberOfLoci{]}:  \begin{enumerate}  \item  Adjust Phenotypes row is a vector defined by the following algorithm: the  vector representing the effect corrections (the adjusted effects caused  by each loci for each animal i)  for animal i plus the background gene  effects plus the environmental effects, plus the size of  the current locus j  \item  Calculate variance QTL effects  for each parent. The goal is to calculate  the current locus  \item  Sample from a uniform distribution  \item  If joint posterior  probability density of the unknown parameters mentioned in the algorithm  previously (every vector except the effects caused by each loci for each  animal).  The posterior  probability is less than random variable: the probability that an event occurs given  recent evidence. Bayes C estimates the probability the parameters affect  the given phenotypes by finding the prior information about how the  probability the phenotypes would be affected and the likelihood of the  parameters occuring.  The locus had a  significant effect. Calculate issue with computing these probabilities is  that there is often no closed form for computing  the variance parameters or would  be unfeasible  due to integrating over multiple dimensions. So instead  samples are drawn from  the locus probability distribution. This sampling  starts with a valid starting point  and record that it was fitted.  \item  Else: Record that draws samples from  the locus does not fit.  \end{enumerate}  \item  Sample distribution, measuring how each parameter (or loci) affects those after  it. It also takes into account  the locus effect variance  \item  Accumulate posterior mean prior parameter's affect on the prior  loci. This is called Gibbs sampling, and by iterating a large number  of probability times it forms a Markov Chain with a stationary  distribution \end{enumerate}  \end{enumerate} estimating  the probability function of the parameters' affect.  \section{CUDA Implementation}  We chose two different tactics for our CUDA implementation. On one hand, 

will consist of identifying sections of code (often loops) where  multiple sets of data are used with the same code. To effectively  utilize the multi-thread model each thread should execute without  dependencies on each other. We identified believed  the section of code focusing area with greatest potential  for speedup  on the GPU was the loop  sampling the effects of for  each marker as the most parallelizable. locus which might run with 50 thousand to 700 thousand markers.  Thenew high level description of the  code is as follows:  \begin{enumerate}  \item  For for this loop follows  \begin{verbatim}  void sampleEffects() {  Eigen::Map currentDataLocus(genotypeDataPointer, numObs);  Eigen::Map currentTestLocus(genotypeTestPointer, testObs);  for (unsigned  i = 0; i < numberMarkers; i++)  {  new (¤tDataLocus) Eigen::Map (&dataPtr->X(0, i), numObs);   new (¤tTestLocus) Eigen::Map (&testPtr->X(0, i), testObs);   float oldSamplei = solSample[i];  float rhsModeli = (currentDataLocus.dot(RInverseY)) + diagLhs(i) * oldSamplei;  rhsModeli *= invVarResidual;  float Lhs = diagLhs[i] * invVarResidual + invVarEffects;  float probDeltaOne = 1.0f / (1.0f + exp(logDeltaZero + 0.5f * (log(Lhs) + logVarEffects - rhsModeli*rhsModeli/Lhs) - logPiComp));  if (matvec::ranf() < probDeltaOne) {  sigma[i] = varEffects; // sigma[i] is used  in {[}1..chainLength{]}  \begin{enumerate}  \item  Sample Residual Variance  \item  Sample the Intercept  \item  Move genotype, phenotype data to GPU and allocate space for kernel  data coming back  \item  Launch a kernel with one thread sampleCommonSigma  solSample[i] = matvec::snorm() * sqrt(1.0f/Lhs) + rhsModeli/Lhs;   meanEffects[i] += (solSample[i] - meanEffects[i]) * invThisIter;  sdDelta1[i] += (solSample[i] * solSample[i] - sdDelta1[i]) * invThisIter;  mlFreq[i] += (1.0f - mlFreq[i]) * invThisIter;  estLocusVars[i] += (varEffects - estLocusVars[i]) * invThisIter;  RInverseY += ((currentDataLocus.array() * dataPtr->vectorRInverse.array()) * (oldSamplei - solSample[i])).matrix();  genotValueVec += currentDataLocus * solSample[i];  if (iteration % outputFreq == 0) {   testGenotValueVec += currentTestLocus * solSample[i];  if (wantWindowBV) {  windowBV.col(dataPtr->WindowcM[i]) += currentDataLocus * solSample[i];  cumSample[numberEffects].icolumn = i; // store samples  cumSample[numberEffects].fSample = solSample[i];  }  }  int istart = max(0, int(i - windowWidth));  unsigned iend = min(i + windowWidth + 1, numberMarkers);  for each locus (unsigned  j \begin{enumerate}  \item  Adjust Phenotypes for the current locus = istart;  j \item  Calculate variance for < iend; j++)  windowStatus[j] = 1.f; // record this window had a SNP fitted in  the current locus  \item  Sample model  numberEffects++;  }  else {  if (oldSamplei)  RInverseY += ((currentDataLocus.array() * dataPtr->vectorRInverse.array()) * oldSamplei).matrix();  sigma[i] = 0.0;  solSample[i] = 0.0f;   meanEffects[i] -= meanEffects[i] * invThisIter;  sdDelta1[i] -= sdDelta1[i] * invThisIter;  mlFreq[i] -= mlFreq[i] * invThisIter;  estLocusVars[i] -= estLocusVars[i] * invThisIter;  }  } // end of loop over markers (i is the loop variable)  /*float*/ genVar = calc_variance(genotValueVec);  /*float*/ meanGV += (genVar - meanGV) * invThisIter;  /*VectorXf*/ mlFreqWindow += (windowStatus - mlFreqWindow) * invThisIter;  /*VectorXf*/ windowStatus.setZero();  }  \end{verbatim}  The code is fairly straightforward; it's mainly a series of floating  point calculations and it samples  from auniform distribution  \item  If probability is less than  random variable: Sample function to determine  whether  the locus effect gets a sample or does not. The main barrier against  parallelization takes place where the comments in the code have marked  "ISSUE A", B, and C. At the beginning of each iteration the variable  rhsModeli is calculated  from a dot product between  the current locus  data and record that it was fitted.  \item  Else: Record that RInverseY (ISSUE A). But at "ISSUE B," RInverseY is adjusted if  the locus was gets a sample. Even if it does  not fitted.  \end{enumerate}  \item  Move data back from GPU  \item  Sample get a sample, "ISSUE C"  highlights that RInverseY may still be adjusted if  the locus effect variance  \item  Accumulate posterior mean of probability distribution  \end{enumerate}  \end{enumerate}  Because previously  got a sample on  the number last iteration  of the Markov Chain.  We spoke with Dorian Garrick, a contributor to the Bayes C algorithm and  codebase, who told us that the marker  loci can potentially will only  be very large (potentially  1000s) and sampled in about  1-5\% of  the process behind sampling each marker's effects is  independent from iterations in  the other, sample effects loop  we thought are targeting with  50 thousand loci, and that percentage drops to 0.01-0.05\% for 700  thousand. Our revised pseudocode algorithm to skirt around  this was issue  follows:  \begin{verbatim}  sampleEffects() {  // copies data from Eigen objects to the GPU  copyDataToGPU();  int sampleIndex = -1;    while (true) {  if (notFirstTime)  copyRInverseYToGPU();  // sampleIndex is which sample previously got  a prime target sample  launchSampleEffectsKernel(sampleIndex);  copyDataFromGPU();  for parallelization. Where (loci in [ sampleIndex + 1 .. numberMarkers]) {  if (hasSample(loci)) {  adjustRInverseY();  numberOfEffects++;  break;  }  else if (hasPreviousSample(loci) {  adjustRInverseY();  break;  }  }  }  copyDataFromGPU();  }  \end{verbatim}  The idea of  the sequential approach would be an iterative  loop with algorithm is still to launch  a counter ranging over each marker, instead CUDA's thread  model allows scaling kernel to compute  the problem over effects of  each marker by identifying loci in parallel, but upon completion of  each one uniquely kernel it checks to see if RInverseY should have been adjusted for any  of the loci. It adjusts it, then re-launches the kernel  with the  previously sampled marker index so that previous marker loci don't run.  This presents obvious overhead in that to calculate the effects for each  marker the algorithm potentially does as many kernel launches as there  are marker loci. Additionally, there is more time wasted copying  RInverseY to the GPU whenever it needs to be adjusted. While the  algorithm may run in parallel for some marker loci, the cost is high for  those iterations of the Markov Chain with more than  a thread index. few samples in the  set. Each time the kernel returns all remaining marker loci have to be  checked for samples.  \subsection{Parallelizing Linear Algebra using cuBLAS} 

that rather than using a CPU-bound library for Linear Algebra math  called Eigen, we instead used cuBLAS to do all of the data manipulation.  Using cuBLAS is fairly simple. For example, to parallelize a dot product  on two single precision floating arrays you can use the cublasSdot()  function. This function simply takes 2 floating point arrays, the size  of the vectors, and a return float array as well as some helper  parameters. The obvious benefit to using functions such as this is that  it takes out the tedious work of defining a kernel for what would be a  library call in CPU space. For our working set we have used cuBLAS to  parallelize a dot product in the sampleEffectsBayesC function where the  currentDataLocus needs to be combined with the current RInverseY using a  dot product. The original code base did this like so:  \begin{verbatim}  float rhsModeli = (currentDataLocus.dot(RInverseY)) + diagLhs(i) * oldSamplei;  \end{verbatim}  To do this in cuBLAS you can do the following:  \begin{verbatim}  cudaMalloc((void**)&devCurrentDataLocus,numObs*sizeof(float));  cudaMalloc((void**)&devRInverseY,numObs*sizeof(float));  cudaMemcpy(devCurrentDataLocus,currentDataLocus.data(),numObs*sizeof(float),cudaMemcpyHostToDevice);  cudaMemcpy(devRInverseY,RInverseY.data(),numObs*sizeof(float),cudaMemcpyHostToDevice);  float rhsModeli;  cublasSdot(cublasHandle,numObs,currentDataLocus.data(),0,RInverseY.data(),0,&rhsModeli);  rhsModeli += (diagLhs(i) * oldSamplei);  \end{verbatim}  You must also setup the handle that cuBLAS uses to access the GPU, this  can be done like so:  \begin{verbatim}  cublasHandle = 0;  cublasStatus_t cublasStatus;  cublasStatus = cublasCreate(&cublasHandle);  \end{verbatim}  Although in this example it looks like using cuBLAS is more complicated  normally you could not simply call dot() on a float array, that is only  allowed above because currentDataLocus and RInverseY are matvec objects.  Similarly, in the cuBLAS code we use the .data() function of the matvec  vector to extract the underlying float array.  \section{Results}  For testing purposes we used a system with a EVGA 560ti running centOS  6.3. The average run time for the linear BayeC algorthim was 20.689s.  \subsection{Sequential Algorithm}  The five most used methods of the sequential GenSel program follows:  \begin{verbatim}  Each sample counts as 0.01 seconds.  % cumulative self self total   time seconds seconds calls Ks/call Ks/call name   97.88 1396.70 1396.70 41000 0.00 0.00 Bayes::sampleEffectsBayesC()  1.00 1411.00 14.30 137046 0.00 0.00 void Eigen::internal::solve_retval >, Eigen::CwiseNullaryOp, Eigen::Matrix > >::evalTo >(Eigen::Matrix&) const  0.42 1417.04 6.04 1 0.01 1.42 Bayes::singleSiteBayesC()  0.32 1421.59 4.55 1 0.00 1.43 Bayes::setup(Options&, Data&, Data&)  0.22 1424.78 3.19 41000 0.00 0.00 Bayes::sampleOther()  \end{verbatim}  As expected the majority of the computation time is spent in  sampleEffectsBayesC which takes over 23 minutes. The next most is a  matrix operation, the loop that comprises the Markov Chain, setup, and a  similar sampling loop to the main effect sampling loop.  \subsection{Parallelizing across Loci}  Unfortunately our kernel parallelizing across loci does not work all the  way through right now. Currently it can complete seven iterations of the  Markov Chain in roughly 3 minutes on the CSC Lab Machines which is  orders of magnitude less than the sequential version which can finish  over 100 iterations of the Markov Chain in under a second. We believe  the main barrier the performance is the numerous kernel launches which  slows the program down between checking for samples, readjusting the  RInverseY vector, and copying data across to the GPU for another kernel  launch.  \subsection{Parallelizing Linear Algebra}  When we attempted to make a dot product parallel using cuBlasSdot() our  result was actually a significant slowdown. When changing the  aforementioned dot product to use the cuBLAS API our execution time  increased to 164.972s.  We analyzed our changes using nvprof to determine where the extra time  was coming from and discovered that 100  \begin{verbatim}  ======== Profiling result:  Time(%) Time Calls Avg Min Max Name  100.00 864ns 1 864ns 864ns 864ns [CUDA memcpy HtoD]  0.00 0ns 1 0ns 0ns 0ns void dot_kernel(cublasDotParams)  0.00 0ns 1 0ns 0ns 0ns void reduce_1Block_kernel(float*, int, float*)  \end{verbatim}  This is an unusual result and must be further investigated. One  possibility is that, since the dot product is not the most  computationally expensive operation and the data set is small, the  amount of time spent in the kernel is just too small for nvprof to  measure. If this is the case then the cost of 864ns for transferring  memory does not pay off.  \section{Conclusion}  Our efforts to create a parallel version of the BayesC algorithm used in  GenSel have not been very successful so far.  Creating an entirely new kernel for the algorithm has been especially  time consuming and difficult. The complexity of the singly threaded  application and its many dependencies on third party libraries that  simply do not interface well with the CUDA kernel launching mechanisms  has made this task difficult to accomplish.  The cuBLAS libraries offered a simple to use way of making linear  algebra (such as the dot product example we explored) run in parallel on  the many cores of a GPU. These libraries, although useful, may not be  applicable to our project. As far as we can tell there are no large  matrix multiplications and the most calculation dot products and  element-wise array multiplication. These calculations can be done in  parallel quite easily and very quickly however they are not very  computationally expensive whether they are done in parallel or not. In  any case, it would appear that the cost of moving memory from the CPU to  the GPU (and back) was greater than the benefit we received in doing the  calculation on the GPU.  While we hoped that a kernel computing effects for each marker in  parallel would offer a substantial speedup so far the results have not  been there. Similar to the problems with the linear algebra library it  may be that copying the data to the GPU takes more time than computing  each locus sequentially. But there is definite room for optimization,  and the algorithm is not fully functioning yet so it may be too soon  tell.  \end{document}