Ben Hirsch edited CUDA_Implementation.tex  about 11 years ago

Commit id: e77f2311d7d5387001bb46c7daaba65615e87155

deletions | additions      

       

}  \end{verbatim}  The code is fairly straightforward; it's mainly a series of floating point calculations and it samples from a random function to determine whether the locus gets a sample or does not. The main barrier against \begin{enumerate}  \item For i parallelization takes place where the comments  in [1..chainLength]  \begin{enumerate}  \item Sample Residual Variance  \item Sample the Intercept  \item Move genotype, phenotype data to GPU the code have marked "ISSUE A", B,  and allocate space for kernel data coming back  \item Launch a kernel with one thread for each locus j  \begin {enumerate}  \item Adjust Phenotypes for C. At  the current locus j  \item Calculate variance for beginning of each iteration  the current locus  \item Sample variable rhsModeli is calculated  from a uniform distribution  \item If probability is less than random variable: Sample dot product between  the current  locuseffect from the  data and record that it was fitted.  \item Else: Record that RInverseY (ISSUE A). But at "ISSUE B," RInverseY is adjusted if  the locus was gets a sample. Even if it does  not fitted.  \end{enumerate}  \item Move data back from GPU  \item Sample get a sample, "ISSUE C" highlights that RInverseY may still be adjusted if  the locus effect variance  \item Accumulate posterior mean of probability distribution  \end{enumerate}  \end{enumerate}  Because previously got a sample on  the number last iteration  of the Markov Chain.  We spoke with Dorian Garrick, a contributor to the Bayes C algorithm and codebase, who told us that the marker  loci can potentially will only  be very large (potentially 1000s) and sampled in about 1-5% of  the process behind sampling each marker's effects is independent from iterations in  the other, sample effects loop  we thought are targeting with 50 thousand loci, and that percentage drops to 0.01-0.05% for 700 thousand. Our revised pseudocode algorithm to skirt around  this was issue follows:  \begin{verbatim}  sampleEffects() {  // copies data from Eigen objects to the GPU  copyDataToGPU();  int sampleIndex = -1;    while (true) {  if (notFirstTime)  copyRInverseYToGPU();  // sampleIndex is which sample previously got  a prime target sample  launchSampleEffectsKernel(sampleIndex);  copyDataFromGPU();  for parallelization. Where (loci in [ sampleIndex + 1 .. numberMarkers]) {  if (hasSample(loci)) {  adjustRInverseY();  numberOfEffects++;  break;  }  else if (hasPreviousSample(loci) {  adjustRInverseY();  break;  }  }  }  copyDataFromGPU();  }  \end{verbatim}  The idea of  the sequential approach would be an iterative loop with algorithm is still to launch  a counter ranging over kernel to compute the effects of  each marker, instead CUDA's thread model allows scaling the problem over marker loci in parallel, but upon completion of  each kernel it checks to see if RInverseY should have been adjusted for any of the loci. It adjusts it, then re-launches the kernel with the previously sampled  marker by identifying index so that previous marker loci don't run.  This presents obvious overhead in that to calculate the effects for  each one uniquely marker the algorithm potentially does as many kernel launches as there are marker loci. Additionally, there is more time wasted copying RInverseY to the GPU whenever it needs to be adjusted. While the algorithm may run in parallel for some marker loci, the cost is high for those iterations of the Markov Chain  with more than  a thread index. few samples in the set. Each time the kernel returns all remaining marker loci have to be checked for samples.  \subsection{Parallelizing Linear Algebra using cuBLAS}