this is for holding javascript data
Ben Hirsch generating latex version of article
about 11 years ago
Commit id: 124d82820deddefcaad755e0770fb0bd40df9b5a
deletions | additions
diff --git a/Final Paper.tex b/Final Paper.tex
index 08e23d8..b89eb99 100644
--- a/Final Paper.tex
+++ b/Final Paper.tex
...
\subsection{Bayesian Estimation of SNP Effects}
Marker effects were estimated using the BayesC algorithm presented by
Kizilkaya et al \cite{kizilkaya}.
A high level overview of The inputs for the algorithm
follows.
\subsection{Algorithm Overview}
\begin{enumerate}
\item
For are a
list of marker loci and potential phenotypes that they represent. A
matrix of daughter yield deviation vectors for each animal i
in {[}1..chainLength{]}
\begin{enumerate}
\item
Sample Residual Variance
\item
Sample the Intercept
\item
For is formed
where each
j in {[}0..numberOfLoci{]}:
\begin{enumerate}
\item
Adjust Phenotypes row is a vector defined by the following algorithm: the
vector representing the effect corrections (the adjusted effects caused
by each loci for each animal i) for
animal i plus the background gene
effects plus the environmental effects, plus the size of the
current locus j
\item
Calculate variance QTL effects
for
each parent. The goal is to calculate the
current locus
\item
Sample from a uniform distribution
\item
If joint posterior
probability density of the unknown parameters mentioned in the algorithm
previously (every vector except the effects caused by each loci for each
animal).
The posterior probability is
less than random variable: the probability that an event occurs given
recent evidence. Bayes C estimates the probability the parameters affect
the given phenotypes by finding the prior information about how the
probability the phenotypes would be affected and the likelihood of the
parameters occuring. The
locus had a
significant effect. Calculate issue with computing these probabilities is
that there is often no closed form for computing the
variance parameters or would
be unfeasible due to
integrating over multiple dimensions. So instead
samples are drawn from the
locus probability distribution. This sampling
starts with a valid starting point and
record that it was fitted.
\item
Else: Record that draws samples from the
locus does not fit.
\end{enumerate}
\item
Sample distribution, measuring how each parameter (or loci) affects those after
it. It also takes into account the
locus effect variance
\item
Accumulate posterior mean prior parameter's affect on the prior
loci. This is called Gibbs sampling, and by iterating a large number of
probability times it forms a Markov Chain with a stationary distribution
\end{enumerate}
\end{enumerate} estimating
the probability function of the parameters' affect.
\section{CUDA Implementation}
We chose two different tactics for our CUDA implementation. On one hand,
...
will consist of identifying sections of code (often loops) where
multiple sets of data are used with the same code. To effectively
utilize the multi-thread model each thread should execute without
dependencies on each other. We
identified believed the
section of code focusing area with greatest potential
for speedup on
the GPU was the loop sampling the effects
of for each marker
as the most parallelizable. locus which might run with 50 thousand to 700 thousand markers. The
new high level description of the code
is as follows:
\begin{enumerate}
\item
For for this loop follows
\begin{verbatim}
void sampleEffects() {
Eigen::Map currentDataLocus(genotypeDataPointer, numObs);
Eigen::Map currentTestLocus(genotypeTestPointer, testObs);
for (unsigned i
= 0; i < numberMarkers; i++)
{
new (¤tDataLocus) Eigen::Map (&dataPtr->X(0, i), numObs);
new (¤tTestLocus) Eigen::Map (&testPtr->X(0, i), testObs);
float oldSamplei = solSample[i];
float rhsModeli = (currentDataLocus.dot(RInverseY)) + diagLhs(i) * oldSamplei;
rhsModeli *= invVarResidual;
float Lhs = diagLhs[i] * invVarResidual + invVarEffects;
float probDeltaOne = 1.0f / (1.0f + exp(logDeltaZero + 0.5f * (log(Lhs) + logVarEffects - rhsModeli*rhsModeli/Lhs) - logPiComp));
if (matvec::ranf() < probDeltaOne) {
sigma[i] = varEffects; // sigma[i] is used in
{[}1..chainLength{]}
\begin{enumerate}
\item
Sample Residual Variance
\item
Sample the Intercept
\item
Move genotype, phenotype data to GPU and allocate space for kernel
data coming back
\item
Launch a kernel with one thread sampleCommonSigma
solSample[i] = matvec::snorm() * sqrt(1.0f/Lhs) + rhsModeli/Lhs;
meanEffects[i] += (solSample[i] - meanEffects[i]) * invThisIter;
sdDelta1[i] += (solSample[i] * solSample[i] - sdDelta1[i]) * invThisIter;
mlFreq[i] += (1.0f - mlFreq[i]) * invThisIter;
estLocusVars[i] += (varEffects - estLocusVars[i]) * invThisIter;
RInverseY += ((currentDataLocus.array() * dataPtr->vectorRInverse.array()) * (oldSamplei - solSample[i])).matrix();
genotValueVec += currentDataLocus * solSample[i];
if (iteration % outputFreq == 0) {
testGenotValueVec += currentTestLocus * solSample[i];
if (wantWindowBV) {
windowBV.col(dataPtr->WindowcM[i]) += currentDataLocus * solSample[i];
cumSample[numberEffects].icolumn = i; // store samples
cumSample[numberEffects].fSample = solSample[i];
}
}
int istart = max(0, int(i - windowWidth));
unsigned iend = min(i + windowWidth + 1, numberMarkers);
for
each locus (unsigned j
\begin{enumerate}
\item
Adjust Phenotypes for the current locus = istart; j
\item
Calculate variance for < iend; j++)
windowStatus[j] = 1.f; // record this window had a SNP fitted in the
current locus
\item
Sample model
numberEffects++;
}
else {
if (oldSamplei)
RInverseY += ((currentDataLocus.array() * dataPtr->vectorRInverse.array()) * oldSamplei).matrix();
sigma[i] = 0.0;
solSample[i] = 0.0f;
meanEffects[i] -= meanEffects[i] * invThisIter;
sdDelta1[i] -= sdDelta1[i] * invThisIter;
mlFreq[i] -= mlFreq[i] * invThisIter;
estLocusVars[i] -= estLocusVars[i] * invThisIter;
}
} // end of loop over markers (i is the loop variable)
/*float*/ genVar = calc_variance(genotValueVec);
/*float*/ meanGV += (genVar - meanGV) * invThisIter;
/*VectorXf*/ mlFreqWindow += (windowStatus - mlFreqWindow) * invThisIter;
/*VectorXf*/ windowStatus.setZero();
}
\end{verbatim}
The code is fairly straightforward; it's mainly a series of floating
point calculations and it samples from a
uniform distribution
\item
If probability is less than random
variable: Sample function to determine
whether the locus
effect gets a sample or does not. The main barrier against
parallelization takes place where the comments in the code have marked
"ISSUE A", B, and C. At the beginning of each iteration the variable
rhsModeli is calculated from
a dot product between the
current locus
data and
record that it was fitted.
\item
Else: Record that RInverseY (ISSUE A). But at "ISSUE B," RInverseY is adjusted if
the locus
was gets a sample. Even if it does not
fitted.
\end{enumerate}
\item
Move data back from GPU
\item
Sample get a sample, "ISSUE C"
highlights that RInverseY may still be adjusted if the locus
effect variance
\item
Accumulate posterior mean of probability distribution
\end{enumerate}
\end{enumerate}
Because previously
got a sample on the
number last iteration of
the Markov Chain.
We spoke with Dorian Garrick, a contributor to the Bayes C algorithm and
codebase, who told us that the marker loci
can potentially will only be
very large (potentially
1000s) and sampled in about
1-5\% of the
process behind sampling each marker's effects is
independent from iterations in the
other, sample effects loop we
thought are targeting with
50 thousand loci, and that percentage drops to 0.01-0.05\% for 700
thousand. Our revised pseudocode algorithm to skirt around this
was issue
follows:
\begin{verbatim}
sampleEffects() {
// copies data from Eigen objects to the GPU
copyDataToGPU();
int sampleIndex = -1;
while (true) {
if (notFirstTime)
copyRInverseYToGPU();
// sampleIndex is which sample previously got a
prime target sample
launchSampleEffectsKernel(sampleIndex);
copyDataFromGPU();
for
parallelization. Where (loci in [ sampleIndex + 1 .. numberMarkers]) {
if (hasSample(loci)) {
adjustRInverseY();
numberOfEffects++;
break;
}
else if (hasPreviousSample(loci) {
adjustRInverseY();
break;
}
}
}
copyDataFromGPU();
}
\end{verbatim}
The idea of the
sequential approach would be an iterative
loop with algorithm is still to launch a
counter ranging over each marker, instead CUDA's thread
model allows scaling kernel to compute the
problem over effects of each marker
by identifying loci in parallel, but upon completion of each
one uniquely kernel it checks to see if RInverseY should have been adjusted for any
of the loci. It adjusts it, then re-launches the kernel with
the
previously sampled marker index so that previous marker loci don't run.
This presents obvious overhead in that to calculate the effects for each
marker the algorithm potentially does as many kernel launches as there
are marker loci. Additionally, there is more time wasted copying
RInverseY to the GPU whenever it needs to be adjusted. While the
algorithm may run in parallel for some marker loci, the cost is high for
those iterations of the Markov Chain with more than a
thread index. few samples in the
set. Each time the kernel returns all remaining marker loci have to be
checked for samples.
\subsection{Parallelizing Linear Algebra using cuBLAS}
...
that rather than using a CPU-bound library for Linear Algebra math
called Eigen, we instead used cuBLAS to do all of the data manipulation.
Using cuBLAS is fairly simple. For example, to parallelize a dot product
on two single precision floating arrays you can use the cublasSdot()
function. This function simply takes 2 floating point arrays, the size
of the vectors, and a return float array as well as some helper
parameters. The obvious benefit to using functions such as this is that
it takes out the tedious work of defining a kernel for what would be a
library call in CPU space. For our working set we have used cuBLAS to
parallelize a dot product in the sampleEffectsBayesC function where the
currentDataLocus needs to be combined with the current RInverseY using a
dot product. The original code base did this like so:
\begin{verbatim}
float rhsModeli = (currentDataLocus.dot(RInverseY)) + diagLhs(i) * oldSamplei;
\end{verbatim}
To do this in cuBLAS you can do the following:
\begin{verbatim}
cudaMalloc((void**)&devCurrentDataLocus,numObs*sizeof(float));
cudaMalloc((void**)&devRInverseY,numObs*sizeof(float));
cudaMemcpy(devCurrentDataLocus,currentDataLocus.data(),numObs*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(devRInverseY,RInverseY.data(),numObs*sizeof(float),cudaMemcpyHostToDevice);
float rhsModeli;
cublasSdot(cublasHandle,numObs,currentDataLocus.data(),0,RInverseY.data(),0,&rhsModeli);
rhsModeli += (diagLhs(i) * oldSamplei);
\end{verbatim}
You must also setup the handle that cuBLAS uses to access the GPU, this
can be done like so:
\begin{verbatim}
cublasHandle = 0;
cublasStatus_t cublasStatus;
cublasStatus = cublasCreate(&cublasHandle);
\end{verbatim}
Although in this example it looks like using cuBLAS is more complicated
normally you could not simply call dot() on a float array, that is only
allowed above because currentDataLocus and RInverseY are matvec objects.
Similarly, in the cuBLAS code we use the .data() function of the matvec
vector to extract the underlying float array.
\section{Results}
For testing purposes we used a system with a EVGA 560ti running centOS
6.3. The average run time for the linear BayeC algorthim was 20.689s.
\subsection{Sequential Algorithm}
The five most used methods of the sequential GenSel program follows:
\begin{verbatim}
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ks/call Ks/call name
97.88 1396.70 1396.70 41000 0.00 0.00 Bayes::sampleEffectsBayesC()
1.00 1411.00 14.30 137046 0.00 0.00 void Eigen::internal::solve_retval >, Eigen::CwiseNullaryOp, Eigen::Matrix > >::evalTo >(Eigen::Matrix&) const
0.42 1417.04 6.04 1 0.01 1.42 Bayes::singleSiteBayesC()
0.32 1421.59 4.55 1 0.00 1.43 Bayes::setup(Options&, Data&, Data&)
0.22 1424.78 3.19 41000 0.00 0.00 Bayes::sampleOther()
\end{verbatim}
As expected the majority of the computation time is spent in
sampleEffectsBayesC which takes over 23 minutes. The next most is a
matrix operation, the loop that comprises the Markov Chain, setup, and a
similar sampling loop to the main effect sampling loop.
\subsection{Parallelizing across Loci}
Unfortunately our kernel parallelizing across loci does not work all the
way through right now. Currently it can complete seven iterations of the
Markov Chain in roughly 3 minutes on the CSC Lab Machines which is
orders of magnitude less than the sequential version which can finish
over 100 iterations of the Markov Chain in under a second. We believe
the main barrier the performance is the numerous kernel launches which
slows the program down between checking for samples, readjusting the
RInverseY vector, and copying data across to the GPU for another kernel
launch.
\subsection{Parallelizing Linear Algebra}
When we attempted to make a dot product parallel using cuBlasSdot() our
result was actually a significant slowdown. When changing the
aforementioned dot product to use the cuBLAS API our execution time
increased to 164.972s.
We analyzed our changes using nvprof to determine where the extra time
was coming from and discovered that 100
\begin{verbatim}
======== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00 864ns 1 864ns 864ns 864ns [CUDA memcpy HtoD]
0.00 0ns 1 0ns 0ns 0ns void dot_kernel(cublasDotParams)
0.00 0ns 1 0ns 0ns 0ns void reduce_1Block_kernel(float*, int, float*)
\end{verbatim}
This is an unusual result and must be further investigated. One
possibility is that, since the dot product is not the most
computationally expensive operation and the data set is small, the
amount of time spent in the kernel is just too small for nvprof to
measure. If this is the case then the cost of 864ns for transferring
memory does not pay off.
\section{Conclusion}
Our efforts to create a parallel version of the BayesC algorithm used in
GenSel have not been very successful so far.
Creating an entirely new kernel for the algorithm has been especially
time consuming and difficult. The complexity of the singly threaded
application and its many dependencies on third party libraries that
simply do not interface well with the CUDA kernel launching mechanisms
has made this task difficult to accomplish.
The cuBLAS libraries offered a simple to use way of making linear
algebra (such as the dot product example we explored) run in parallel on
the many cores of a GPU. These libraries, although useful, may not be
applicable to our project. As far as we can tell there are no large
matrix multiplications and the most calculation dot products and
element-wise array multiplication. These calculations can be done in
parallel quite easily and very quickly however they are not very
computationally expensive whether they are done in parallel or not. In
any case, it would appear that the cost of moving memory from the CPU to
the GPU (and back) was greater than the benefit we received in doing the
calculation on the GPU.
While we hoped that a kernel computing effects for each marker in
parallel would offer a substantial speedup so far the results have not
been there. Similar to the problems with the linear algebra library it
may be that copying the data to the GPU takes more time than computing
each locus sequentially. But there is definite room for optimization,
and the algorithm is not fully functioning yet so it may be too soon
tell.
\end{document}