Authorea

James Shirley edited results.tex about 11 years ago

Commit id: 3e361398793a597b24fd8c1200b00b3366a33d89

deletions | additions

\section{Results} For testing purposes we used a system with a GeForece EVGA 560 GTX ti Quadro 5000 running Arch. centOS 6.3. \subsection{Sequential Algorithm}

Unfortunately our kernel parallelizing across loci does not work all the way through right now. Currently it can complete seven iterations of the Markov Chain in roughly 3 minutes on the CSC Lab Machines which is orders of magnitude less than the sequential version which can finish over 100 iterations of the Markov Chain in under a second. We believe the main barrier the performance is the numerous kernel launches which slows the program down between checking for samples, readjusting the RInverseY vector, and copying data across to the GPU for another kernel launch. \subsection{Parallelizing Linear Algebra} When we attempted to make a dot product parallel using cuBlasSdot() our result was actually a significant slowdown. When changing the aforementioned dot product to use the cuBLAS API our execution time increased to 159.108s. 283.672s. We analyzed our changes using nvprof (on a different machine - 560gtx ti and Arch) to determine where the extra time was coming from and discovered that all of the extra time was due to moving memory to the GPU. \begin{verbatim} ======== Profiling result: