Authorea

Ben Hirsch edited CUDA_Implementation.tex about 11 years ago

Commit id: e77f2311d7d5387001bb46c7daaba65615e87155

deletions | additions

} \end{verbatim} The code is fairly straightforward; it's mainly a series of floating point calculations and it samples from a random function to determine whether the locus gets a sample or does not. The main barrier against \begin{enumerate} \item For i parallelization takes place where the comments in [1..chainLength] \begin{enumerate} \item Sample Residual Variance \item Sample the Intercept \item Move genotype, phenotype data to GPU the code have marked "ISSUE A", B, and allocate space for kernel data coming back \item Launch a kernel with one thread for each locus j \begin {enumerate} \item Adjust Phenotypes for C. At the current locus j \item Calculate variance for beginning of each iteration the current locus \item Sample variable rhsModeli is calculated from a uniform distribution \item If probability is less than random variable: Sample dot product between the current locuseffect from the data and record that it was fitted. \item Else: Record that RInverseY (ISSUE A). But at "ISSUE B," RInverseY is adjusted if the locus was gets a sample. Even if it does not fitted. \end{enumerate} \item Move data back from GPU \item Sample get a sample, "ISSUE C" highlights that RInverseY may still be adjusted if the locus effect variance \item Accumulate posterior mean of probability distribution \end{enumerate} \end{enumerate} Because previously got a sample on the number last iteration of the Markov Chain. We spoke with Dorian Garrick, a contributor to the Bayes C algorithm and codebase, who told us that the marker loci can potentially will only be very large (potentially 1000s) and sampled in about 1-5% of the process behind sampling each marker's effects is independent from iterations in the other, sample effects loop we thought are targeting with 50 thousand loci, and that percentage drops to 0.01-0.05% for 700 thousand. Our revised pseudocode algorithm to skirt around this was issue follows: \begin{verbatim} sampleEffects() { // copies data from Eigen objects to the GPU copyDataToGPU(); int sampleIndex = -1; while (true) { if (notFirstTime) copyRInverseYToGPU(); // sampleIndex is which sample previously got a prime target sample launchSampleEffectsKernel(sampleIndex); copyDataFromGPU(); for parallelization. Where (loci in [ sampleIndex + 1 .. numberMarkers]) { if (hasSample(loci)) { adjustRInverseY(); numberOfEffects++; break; } else if (hasPreviousSample(loci) { adjustRInverseY(); break; } } } copyDataFromGPU(); } \end{verbatim} The idea of the sequential approach would be an iterative loop with algorithm is still to launch a counter ranging over kernel to compute the effects of each marker, instead CUDA's thread model allows scaling the problem over marker loci in parallel, but upon completion of each kernel it checks to see if RInverseY should have been adjusted for any of the loci. It adjusts it, then re-launches the kernel with the previously sampled marker by identifying index so that previous marker loci don't run. This presents obvious overhead in that to calculate the effects for each one uniquely marker the algorithm potentially does as many kernel launches as there are marker loci. Additionally, there is more time wasted copying RInverseY to the GPU whenever it needs to be adjusted. While the algorithm may run in parallel for some marker loci, the cost is high for those iterations of the Markov Chain with more than a thread index. few samples in the set. Each time the kernel returns all remaining marker loci have to be checked for samples. \subsection{Parallelizing Linear Algebra using cuBLAS}