Nishant Parashar deleted typeinst.tex  over 8 years ago

Commit id: 281e526ff6227321d11933852b6a032b425b0ee8

deletions | additions      

         

\documentclass[runningheads,a4paper]{llncs}  %\documentclass{llncs}  %documentclass[preprint,a4paper]{elsarticle}  \usepackage{fullpage}  \usepackage{epstopdf}  \usepackage{amssymb}  \setcounter{tocdepth}{3}  \usepackage{graphicx}  \usepackage{amsmath}  \usepackage{float}  \usepackage{subfig}  \usepackage{graphicx}  \usepackage{chngcntr}  \usepackage{placeins}  \usepackage[english]{babel}  \usepackage{url}  \usepackage{algorithmicx}  \usepackage{algorithm}  \usepackage{setspace}  \urldef{\mailsa}\path|[email protected],|  \urldef{\mailsb}\path|[email protected],|  \urldef{\mailsc}\path|[email protected]|   \newcommand{\keywords}[1]{\par\addvspace\baselineskip  \noindent\keywordname\enspace\ignorespaces#1}  \usepackage{listings}  \usepackage{cite}  \usepackage[labelfont=bf]{caption}  %\onehalfspacing  \begin{document}  %\begin{frontmatter}  %\fontsize{11pt}{14pt}\selectfont  \mainmatter % start of an individual contribution  % first the title is needed  \title{GPU-accelerated direct numerical simulations of decaying compressible turbulence employing a GKM-based solver}  \author{Nishant Parashar  \and Balaji Srinivasan\and Sawan Suman \and Manish Agarwal}  %\address{Indian Institute of Technology, Hauz Khas, New Delhi, Delhi 110016, India}  \institute{Department of Applied Mechanics, Indian Institute of Technology, Delhi \\  %\mailsa  %\mailsb  %\mailsc\\  \url{http://www.am.iitd.ac.in}}  \maketitle  \section{Introduction}  The Gas Kinetic Method (GKM) proposed by Xu \cite{xu1998} has evolved as a viable method for simulating compressible flows in recent years. The main advantage of GKM over the conventional Navier-Stokes based solver are its robustness over wide range of Mach numbers and its ability to accurately capture shocks and related fluid physics. Furthermore, if required, the GKM scheme allows for a natural accomodation of deviations from equilibrium physics and effect of extremely rarefied physics.   \par  GKM-based algorithms were originally proposed by Xu \cite{xu1998} for laminar flows. Later the scheme was used for simulating decaying turbulence as well. Kerimo at. al. \cite{kerimo2007} successfully employed the scheme for simulating weakly compressible turbulence at turbulent Mach number of 0.052. Later Liao et. al. \cite{luo2009} compared the multidimensional gas kinetic scheme \cite{xu2004} with Xu's original quasi one dimensional GKM scheme \cite{xu1998} for simulating compressible turbulent flows uptill turbulent Mach number of 0.5. They showed that the accuracy and stability of the GKM scheme \cite{xu1998} deteriorates as turbuent Mach number increases. However the multidimensional gas kinetic scheme \cite{xu2004} simlified for smooth flows shows accurate and stable results as compared to pseudo-spectral direct numerical simulations.   Kumar et. al. \cite{kumar2012} employed multidimensional GKM scheme \cite{xu2004} enhanced with weighted essentially non-oscillatory method (WENO) to simulate compressible turbulent flows at high turbulent Mach number of 1.75 establishing a benchmark for high Mach number direct numerical simulations.   \par  So far to the best of the authors’ knowledge GKM simulations have been parallized on the conventional central processing unit (CPU) platforms. In recent years viable graphics processor unit (GPU) technology has become available for no-graphic computations. GPU platforms are purported to offer significantly higher accelerations than traditional CPUs either on their own or in a suitable combination with the traditional CPU platforms. Such acceleration can be of tremendous help for performing DNS of compressible turbulence at high Reynolds and Mach number.   \par  In the recent past there have been some successful attempts in simulating incompressible fluid flow using GPUs. Traditionally, these speedups are reported against single CPU core serial computation. Shinn et. al. \cite{shinn2010} performed DNS simulation of turbulent flows on Tesla C1060 in a square duct with a reported speedup of 15x. Tanno et. al. \cite{tanno2013} performed turbulent flow simulations by lattice Boltzmann method and conventional method on a GPU with a reported speedup of upto 40x on NVIDIA C2050. Francesco et. al. \cite{francesco2013} performing a DNS of a spatially evolving compressible mixing layer on tesla S2070 with 22x speedup. However all these GPU-based simulations have been performed either with Navier Stokes (NS) equations aor Lattice Boltzmann Method (LBM).   \par  GPUs have single instruction multiple thread (SIMT) architecture. Hence not all simulation methodologies can be ported efficiently on GPUs. In general Navier-Stokes based finite difference and finite volume methods are very compute friendly with GPUs. Since GKM is emerging as the preferred method for simulating high Mach number and high Reynolds number simulations, it is imperative to attempt implementing the GKM scheme on GPU platforms and systematically evaluate its performance.  \par  The objective of this work is to evaluate the performance of GKM on GPUs for simulating compressible turbulence. Towards this goal we perform direct numerical simulation of compressible decaying turbulence employing Analytical-GKM scheme, which has been recently developed by Xuan at. al. \cite{li2013}. We extensively compare and evaluate DNS results against high-order accurate Navier-Stokes based DNS results of Honein et. al. \cite{moin2004} and Samtaney et.al. \cite{samtaney2001} over a wide range of turbulent Mach number (0.3-0.5) and Reynolds number (30-175).  We evaluate the speed-up and scalability of AGKM on GPU platform.   \par  This paper is organized into five sections. In section 2, we review AGKM scheme proposed by Xuan at. al. \cite{li2013} and discuss various aspects of the scheme on GPU. In section 3, we discuss our plan of study. In section 4 we evaluate the perfromance of the AGKM scheme on GPU platform in terms of speed-up and scalabilty. Section 5, concludes the work with summary.  \section{Implementation of AGKM on GPU}  Recently Xuan and Xu \cite{li2013} have proposed a novel improved GKM scheme based on exact analytical solution (henceforth called AGKM) of an arbitrary $n^{th}$ order Chapman Enskog expansion of the BGK equation \cite{bhatnagar}. The use of time accurate analytical solution avoids modelling for the temporal derivative that is otherwise used in BGK-NS method. This makes AGKM potentially superior than higher order schemes using Runge-Kutta methods for time accuracy. The new scheme uses a continuous initial reconstruction which significantly reduces the computational cost associated with the time evolution from a discontinuous initial data. Xuan and Xu \cite{li2013} demonstrate that while the scheme preserves stability, high resolution and a fairly good performance in capturing discontinuity (like the earlier BGK-NS method), it has the additional advantage of enhanced efficiency leading to faster computational time. The superiority of the new method is demonstrated by Xuan and Xu \cite{li2013} by performing several stability, accuracy, time efficiency and viscous/heat-conduction tests for 1-D laminar flows. From the point of view of implementation on GPU platform, AGKM is expected to have several advantages. Thus in this work we implement AGKM on GPU platform. The reader is referred to Xuan et. al. \cite{li2013} for the full description of the AGKM scheme.  \par  GPUs have single instruction multiple thread (SIMT) architecture and hence are very efficient in executing data parallelism. General finite volume and finite difference based solvers support data parallelism in flux calculation. The derivatives over the whole grid can be calculated invoking a simple convolution mask kernel and the flux is easily obtained from the derivatives of the flow field. This makes general finite volume and finite difference methods suitable for GPU parallelization. Although GKM based solvers have a core finite volume structure, the flux calculation does not support data parallelism.  Flux calculation involves very tedious computations since the macroscopic field is first processed for Maxwellian distribution fuctions and then using the BGK model the flux is calculated. This whole process involves a lot of computation which requires task parallelism.  \par  The GKM scheme proposed by Xu \cite{xu2004} involves discontinuous reconstruction leading to calculation of two Maxwellians on each side of the cell interface. Using these two Maxwellian distribution functions, an equilibrium distribution function is found and from these three maxwellian distribution functions the flux is calculated. The flux calculation accounts for  bulk of computation time and puts a lot of burden on the limited local thread  memory of the GPU device, making it difficult to effectively parallelize the scheme for high speedups.  \par  On the other hand the recently proposed AGKM scheme by Xuan at. al. \cite{li2013} uses simplified and time accurate flux calculation using the analytical solution to the BGK model. The use of continuous reconstruction at the cell interface reduces the flux calculations as compared to the BGK-NS scheme \cite{xu2004} and the use of time accurate analytical solution to the BGK model avoids modelling of the temporal term. This makes AGKM suitable for GPU implementation.  \par  \section{Plan of study}  In the first part we perform validation of AGKM implemented on GPU. Simulations of decaying isotropic compressible turbulence are performed. The computational domain is cubical with sides of length $2\pi$. Periodic boundary conditions are imposed on the opposite sides of the cube. Initial velocity field is chosen to have a spectrum of the following form:  \begin{equation}  E(k) = A k^{4}e^{-2k^{2}/k_{0}^{2}}  \end{equation}  where $M_{t0}$ is the turbulence Mach number, $k$ represents wavenumber, $k_{0}$ is the most energetic scale in the initial field and $A$ is a constant determining the peak of the spectra. The initial pressure and density field is kept constant.  The initial turbulent Mach number $M_{t0}$ and Taylor's Reynold's number $(Re_{\lambda})$ are defined as :   $$M_{t0} = \sqrt{\frac{}{\gamma p_{0}/\rho_{0}}}$$   $$ Re_{\lambda} = \frac{\rho_{0} \lambda u^{'}}{\mu_{0}} $$  where $u_{i}$ represents the velocity field and $<\hspace{.1cm}>$ represents volume averaging.  $p_{0}$, $T_{0}$ and $\rho_{0}$ represents initial mean pressure temperature and density respectively. $\mu_{0}$ is the mean viscocity and $u^{'}$ is the fluctuating turbulent velocity defined as: $u^{'} = \sqrt{<\frac{u_{1}^2+u_{2}^2+u_{3}^2}{3}>}$  \par  At low Mach number we validate our results by making comparisions against high order accurate NS-based results of Honein et. al. \cite{moin2004}. Honein et.al.\cite{moin2004} performed DNS of decaying isotropic turbulence with initial Mach number of 0.3 and Reynolds number of 30. A sixth order compact scheme is used by the authors. The third order Runge-Kutta time stepping scheme is used for time integration. The scheme is implemented using a conservative skew-symmetric splitting of the nonlinear terms. Comparison is performed in terms of normalized turbulent kinetic energy ($k$) and normalized root mean square pressure ($p^{'}$), specific volume ($V^{'}$) and temperature ($T^{'}$), defined as :   $$k^{'} = \frac{u_{1,rms}^{2} + u_{2,rms}^{2} + u_{3,rms}^{2}}{(c_{0}M_{t0})^{2}}$$  $$p^{'} = \frac{p_{rms}}{\gamma p_{0} M_{t0}^{2}}$$  $$V^{'} = \frac{V_{rms}}{M_{t0}^{2}/\rho_{0}}$$  $$T^{'} = \frac{T_{rms}}{(\gamma-1) T_{0} M_{t0}^{2}}$$  where $p_{rms}$, $v_{rms}$, $T_{rms}$ represents root mean square pressure, specific volume and temperature respectively.   \par  At high Mach numbers, the evaluation is performed by comaparing AGKM results against that of Samtaney at. al. \cite{samtaney2001}. Tenth order Pade scheme was used by the authors along with third order Runge-Kutta for time stepping.   We show validate of AGKM results against two simulation cases done by the authors:  \begin{enumerate}  \item \textbf{Case 1:} $M_{t0} = 0.5$, $Re_{\lambda} = 72$, $k_{0} = 8$   \item \textbf{Case 2:} $M_{t0} = 0.488$, $Re_{\lambda} = 30$, $k_{0} = 4$   \end{enumerate}  Comparison is performed in terms of normalized turbulent kinetic energy ($k^{'}$), normalized root mean square density $\rho^{'}$ and local Mach number $M_{loc}$. The authors define these quantities as :  $$k^{'} = \frac{<\rho u_{i} u_{i}>}{(c_{0}M_{t0})^{2}}$$ $$\rho^{'} = \frac{\rho_{rms}}{M_{t0}^{2}}$$  $$M_{loc} = \sqrt{\frac{u_{1}^{2} + u_{2}^{2} + u_{3}^{2}}{(\gamma p / \rho)}}$$   where, $u_{i}$, $p$ and $\rho$ are local velocity, pressure and density respectively and $M_{t0}$ is the initial turbulent Mach number.  \par   In the second part of the study we evaluate the speed-up and scalability aspects of GPU-implemented AGKM scheme.  In the scientific computation world speed-up of GPU algorithm is generally measured as ratio of simulation run time on GPU platform as compared to single core CPU computation \cite{francesco2013, shinn2010, tanno2013}. We follow the same measure for the speed-up of AGKM GPU implementation.  GPU simulations on NVIDIA Tesla K20m card are compared against single core CPU computation on Intel Xeon E5-2687W, 3.1 GHz processor. The average iteration time is calculated by taking the average run time for 100 iterations taken for 10 runs of the solver.  It is desirable from any parallel algorithm to exploit the hardware resource with equal efficiency for increasing problem size within the memory constaint of the hardware. A good parallel algorithm should handle increasing problem size with equal efficiency (scalabilty), while maintaining the speed-up achieved at small problem sizes.   To evaluate the performance of AGKM on GPU platform, several simulation cases were performed. All these test cases have identical initial conditions but different grid sizes.  Table \ref{Table1} shows a brief description of these simulations.  \begin{table}[!htb]  \caption{\\ Simulation cases}  \begin{center}    \begin{tabular*}{400pt}{ @{\extracolsep{\fill}} c c c c c c c}  \hline   Case & $M_{t0}$ & $Re_{\lambda}$ &A &$k_{0}$ & grid & end time \\ \hline   $A$ & $0.3$ & 30 & 0.000374 &4 & $64^3$ & $3t_{0}$ \\ %\hline  $B$ & $0.5$ & 72 & 0.00013 &8 & $128^3$ & $8t_{0}$ \\ %\hline  $C$ & $0.488$ & 175 & 0.011 &4 & $256^3$ & $7t_{0}$ \\ %\hline  \hline  \end{tabular*}    \end{center}  \label{Table1}  \end{table}   \section{Results and Discussion}  In this section we present our AGKM simulation results that have been computed on a GPU platform. Several simulations have been performed over a range of Mach and Reynold numbers. In section 4.1 we compare low Mach number simulations against the results of Honein et. al. \cite{moin2004}. In section 4.2 we compare high Mach number simulation against the results of Samtaney et. al. \cite{samtaney2001}. Finally in section 4.3 we evaluate the performance of AGKM in terms of speed-up and scalability aspects on GPU platform.   \subsection{Low Mach number}  \begin{figure}[hp!]  \centering  \subfloat[Turbulent kinetic energy decay]{\includegraphics[scale=.37]{TKE3-crop.pdf}}  \subfloat[RMS pressure]{\includegraphics[scale=.37]{VRMS3-crop.pdf}}\\  \subfloat[RMS temperature]{\includegraphics[scale=.37]{TRMS3-crop.pdf}}  \subfloat[RMS specific volume]{\includegraphics[scale=.37]{PRMS3-crop.pdf}}  \caption{DNS simulations using AGKM scheme on $64^{3}$ grid, $Re_{\lambda}$ = 30, $M_{t} = 0.3$ (case A, Table \ref{Table1})\newline Symbol $\textbf{o}$ represents data from Hoein et. al. \cite{moin2004}, solid line $\textbf{|}$ represents AGKM simulation results}  \label{M3}  \end{figure}   \FloatBarrier  \par  In figure \ref{M3} we present AGKM simulation results from case A, Table \ref{Table1}. Turbulent kinetic energy $(K^{'})$, RMS pressure $(P^{'})$, temperature $(T^{'})$ and specific volume $(V^{'})$ are plotted against eddy turnover time ($t_{0}$). Clearly the agreement for each quantity is excellent. Based on these comparisons we conclude that Analytical-GKM simulates low Mach number turbulent flows quite well.  \subsection{High Mach number}  In figure \ref{M5} we present AGKM simulation results from case B, Table \ref{Table1}. Turbulent kinetic energy $(K^{'})$, RMS density $(\rho^{'})$ are plotted against eddy turnover time ($t_{0}$). The results perfectly match the simulation results of Samtaney et. al. \cite{samtaney2001}.   In figure \ref{M488} we present AGKM simulation results from case C, Table \ref{Table1}. Turbulent kinetic energy $(K^{'})$, PDF of local Mach number $(M_{loc})$ are plotted against eddy turnover time $(t_{0})$. Clearly the agreement for each quantity is excellent.   Based on these comparisons we conclude that Analytical-GKM simulates high Mach number and high Reynold number turbulent flows quite well.  \begin{figure}  \centering  \subfloat[Turbulent KE decay]{\includegraphics[scale=.38]{TKE5-crop.pdf}}  \subfloat[RMS Density]{\includegraphics[scale=.38]{DRMS5-crop.pdf}}\\  \caption{DNS simulations using AGKM scheme on $128^{3}$ grid, $Re_{\lambda}$ = 72, $M_{t}$ = 0.5 (case B, Table \ref{Table1})\newline Symbol $\textbf{o}$ represents data from Samtaney et. al. \cite{samtaney2001}, solid line $\textbf{|}$ represents AGKM simulation results}  \label{M5}  \end{figure}  \FloatBarrier  \begin{figure}[h]  \centering  \subfloat[Turbulent KE decay]{\includegraphics[scale=.38]{TKE488-crop.pdf}}  \subfloat[PDF of local Mach number]{\includegraphics[scale=.38]{MachPDF-crop.pdf}}  \caption{DNS simulations using AGKM on grid = $256^3$, $M_{t}$ = 0.488, $Re_{\lambda}$ = 175 (case C, Table \ref{Table1})\newline Symbol $\textbf{o}$ represents data from Samtaney et. al. \cite{samtaney2001}, solid line $\textbf{|}$ represents AGKM simulation results}  \label{M488}  \end{figure}   \FloatBarrier  Similar agreement in the accuracy can be seen when we compare AGKM with the high Reynold number simulation at high Reynold number of 175 . Figure \ref{M488} shows AGKM simulation results from case C, Table \ref{Table1}. Turbulent kinetic energy $(K^{'})$ and PDF of local Mach number $(M_{loc})$ are plotted against eddy turnover time ($t_{0}$). Based on these comparisions we conclude that AGKM simulates high reynold number turbulent flows quite well.   \subsection{GPU Speedup and Scalability}  The speedups recorded for 5th order scheme are mentioned in Table \ref{Table2}. All the tests are run at the same initial condition (case B, Table \ref{Table1}), with different grid sizes.  \begin{table}[!htb]  \caption{\\ GPU speedup: Initial conditions (case B, table \ref{Table1}) }  \begin{center}  \begin{tabular*}{400pt}{ @{\extracolsep{\fill}} c c c c }  \hline   grid & avg. iteration time CPU (ms) & avg. iteration time GPU (ms) & speed-up \\ \hline  $64^3$ & 827 & 20 & 41 \\ %\hline  $128^3$ & 6910 & 151 & 46 \\ %\hline  $192^3$ & 25430 & 510 & 50 \\ %\hline   $256^3$ & 57482 & 1128 & 51 \\ \hline   \end{tabular*}  \end{center}  \label{Table2}   \end{table}  All the computations in the scheme are performed on the GPU device. As already discussed, the  solver has a structure similar to a general NS based finite volume scheme. The  above algorithm with exclusive usage of device's global memory. The first  attempt used 3-dimensional thread indexing, yielding a maximum speedup of 10-14x.   \par  For better control and usage of GPU resources, 1-dimensional thread indexing is  used. Since the derivatives of density, momentum and pressure are independent  of each other, they are run simultaneously using streams. Similarly other  kernels viz. updating the flow field and periodic boundary condition  implementation are also executed concurrently using streams. Implementing the  above mentioned alterations boosts the speedup to over 36x across all the  problem sizes discussed in the paper. The block size has a significant effect on  the speedup. While working with multiple streams, it is desirable to reduce the  block size. The most optimum block size while executing streams is 128 threads;  while that for the flux kernel is 64 threads. This fine tuning results in a  further performance boost of $\scriptstyle\mathtt{\sim}$3-4x, reaching a total speed of 40-50x across  problem sizes.   \par  We also observe that the current scheme is also scalable. The speed up {\it  increases} with increasing grid size as shown in the Table \ref{Table2}. The  power usage also increases from $\scriptstyle\mathtt{\sim}$95 W at $64^3$ grid size to  $\scriptstyle\mathtt{\sim}$130 W at $256^{3}$ grid size, confirming that the more GPU cores are  being utilized concurrently.  \section{Conclusions}  We evaluate the performance of recently proposed novel analytical gas kinetic  method in simulating fully resolved compressible decaying turbulence over a  range of Mach and Reynolds number. We have successfully implemented the AGKM  scheme on GPUs. We demonstrate that AGKM scheme shows excellent agreement with  high order accurate N-S based DNS results. We conclude that AGKM is far more  suitable to be implemented on GPUs as compared to original multidimensional GKM.  The current memory restriction on the gpu puts a constraint on grid sizes  greater that $256^3$. Hence a data division scheme over multiple GPUs would be  necessary in order to keep the total run time tractable when higher grid sizes  are required.   \bibliographystyle{plain}  %\onehalfspacing  \bibliography{reference}  \end{document}