this is for holding javascript data
Ben Hirsch edited CUDA_Implementation.tex
about 11 years ago
Commit id: e77f2311d7d5387001bb46c7daaba65615e87155
deletions | additions
diff --git a/CUDA_Implementation.tex b/CUDA_Implementation.tex
index 5daaf58..53bcb34 100644
--- a/CUDA_Implementation.tex
+++ b/CUDA_Implementation.tex
...
}
\end{verbatim}
The code is fairly straightforward; it's mainly a series of floating point calculations and it samples from a random function to determine whether the locus gets a sample or does not. The main barrier against
\begin{enumerate}
\item For i parallelization takes place where the comments in
[1..chainLength]
\begin{enumerate}
\item Sample Residual Variance
\item Sample the Intercept
\item Move genotype, phenotype data to GPU the code have marked "ISSUE A", B, and
allocate space for kernel data coming back
\item Launch a kernel with one thread for each locus j
\begin {enumerate}
\item Adjust Phenotypes for C. At the
current locus j
\item Calculate variance for beginning of each iteration the
current locus
\item Sample variable rhsModeli is calculated from a
uniform distribution
\item If probability is less than random variable: Sample dot product between the
current locus
effect from the data and
record that it was fitted.
\item Else: Record that RInverseY (ISSUE A). But at "ISSUE B," RInverseY is adjusted if the locus
was gets a sample. Even if it does not
fitted.
\end{enumerate}
\item Move data back from GPU
\item Sample get a sample, "ISSUE C" highlights that RInverseY may still be adjusted if the locus
effect variance
\item Accumulate posterior mean of probability distribution
\end{enumerate}
\end{enumerate}
Because previously got a sample on the
number last iteration of
the Markov Chain.
We spoke with Dorian Garrick, a contributor to the Bayes C algorithm and codebase, who told us that the marker loci
can potentially will only be
very large (potentially 1000s) and sampled in about 1-5% of the
process behind sampling each marker's effects is independent from iterations in the
other, sample effects loop we
thought are targeting with 50 thousand loci, and that percentage drops to 0.01-0.05% for 700 thousand. Our revised pseudocode algorithm to skirt around this
was issue follows:
\begin{verbatim}
sampleEffects() {
// copies data from Eigen objects to the GPU
copyDataToGPU();
int sampleIndex = -1;
while (true) {
if (notFirstTime)
copyRInverseYToGPU();
// sampleIndex is which sample previously got a
prime target sample
launchSampleEffectsKernel(sampleIndex);
copyDataFromGPU();
for
parallelization. Where (loci in [ sampleIndex + 1 .. numberMarkers]) {
if (hasSample(loci)) {
adjustRInverseY();
numberOfEffects++;
break;
}
else if (hasPreviousSample(loci) {
adjustRInverseY();
break;
}
}
}
copyDataFromGPU();
}
\end{verbatim}
The idea of the
sequential approach would be an iterative loop with algorithm is still to launch a
counter ranging over kernel to compute the effects of each
marker, instead CUDA's thread model allows scaling the problem over marker loci in parallel, but upon completion of each
kernel it checks to see if RInverseY should have been adjusted for any of the loci. It adjusts it, then re-launches the kernel with the previously sampled marker
by identifying index so that previous marker loci don't run.
This presents obvious overhead in that to calculate the effects for each
one uniquely marker the algorithm potentially does as many kernel launches as there are marker loci. Additionally, there is more time wasted copying RInverseY to the GPU whenever it needs to be adjusted. While the algorithm may run in parallel for some marker loci, the cost is high for those iterations of the Markov Chain with
more than a
thread index. few samples in the set. Each time the kernel returns all remaining marker loci have to be checked for samples.
\subsection{Parallelizing Linear Algebra using cuBLAS}