\documentclass[review]{elsarticle}
%\documentclass[preprint,12pt]{elsarticle}
\usepackage{fullpage}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{float}
\usepackage{subfig}
\usepackage{graphicx}
\usepackage{placeins}
\usepackage[english]{babel}
\usepackage{url}
\usepackage{algorithmicx}
\usepackage{algorithm}
\usepackage{setspace}
\urldef{\mailsa}\path|parasharnishant9@gmail.com,|
\urldef{\mailsb}\path|balaji@am.iitd.ac.in,|
\urldef{\mailsc}\path|sawan@am.iitd.ac.in|
\usepackage[labelfont=bf]{caption}
\journal{Computers $\&$ Fluids}
%%%%%%%%%%%%%%%%%%%%%%%
%% Elsevier bibliography styles
%%%%%%%%%%%%%%%%%%%%%%%
%% To change the style, put a % in front of the second line of the current style and
%% remove the % from the second line of the style you would like to use.
%%%%%%%%%%%%%%%%%%%%%%%
%% Numbered
%\bibliographystyle{model1-num-names}
%% Numbered without titles
%\bibliographystyle{model1a-num-names}
%% Harvard
%\bibliographystyle{model2-names.bst}\biboptions{authoryear}
%% Vancouver numbered
%\usepackage{numcompress}\bibliographystyle{model3-num-names}
%% Vancouver name/year
%\usepackage{numcompress}\bibliographystyle{model4-names}\biboptions{authoryear}
%% APA style
%\bibliographystyle{model5-names}\biboptions{authoryear}
%% AMA style
%\usepackage{numcompress}\bibliographystyle{model6-num-names}
%% `Elsevier LaTeX' style
\bibliographystyle{elsarticle-num}
%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\begin{frontmatter}
\title{GPU-accelerated direct numerical simulations of decaying compressible turbulence employing a GKM-based solver}
\author{Nishant Parashar} %\corref{cor1}}%\fnref{myfootnote}}
%\cortext[cor1]{Corresponding author. Tel: +919810392938}
\ead{nishantparashar14@gmail.com}
\author{Balaji Srinivasan}
\author{Sawan Suman}
\author{Manish Agarwal}
\address{Indian Institute of Technology, Hauz Khas, New Delhi 110016, India}
%\fntext[myfootnote]{Since 1880.}
%% or include affiliations in footnotes:
%\author[mymainaddress]{Department of Applied Mechanics, IIT Delhi, Hauz Khas, New Delhi, Delhi 110016, India}
%\author[mysecondaryaddress]{Global Customer Service\corref{mycorrespondingauthor}}
\cortext[mycorrespondingauthor]{Corresponding author. Tel: +919810392938}
%\address[mymainaddress]{1600 John F Kennedy Boulevard, Philadelphia}
%\address[mysecondaryaddress]{360 Park Avenue South, New York}
\begin{abstract}
.............................................................
\end{abstract}
\begin{keyword}
GPU, Direct numerical simulation, Compressible turbulence, Gas Kinetic Method
\end{keyword}
\end{frontmatter}
%\linenumbers
\section{Introduction}
The Gas Kinetic Method (GKM) proposed by Xu \verb|\cite{xu1998}| has evolved as a viable method for simulating compressible flows in recent years. The main advantage of GKM over the conventional Navier-Stokes (NS) based solver are its robustness over wide range of Mach numbers and its ability to accurately capture shocks and related fluid physics. Furthermore, if required, the GKM scheme allows for a natural accomodation of deviations from equilibrium and effect of extremely rarefied physics.
\par
GKM-based algorithms were originally proposed by Xu \verb|\cite{xu1998}| for laminar flows. The Chapman-Enskog expansion of the BGK model \verb|\cite{bhatnagar}| constraints the Prandtl number to unity. Prandtl number corrections are suggested by Xu\verb|\cite{xu1998}| and May et. al. \verb|\cite{balaji2007}|. Later the scheme was used for simulating decaying turbulence as well. Kerimo et. al. \verb|\cite{kerimo2007}| successfully employed the scheme for simulating weakly compressible turbulence at turbulent Mach number of 0.052. Later Liao et. al. \verb|\cite{luo2009}| compared the multidimensional GKM \verb|\cite{xu2004}| with quasi one dimensional GKM scheme \verb|\cite{Xu2001}| for simulating compressible turbulent flows at turbulent Mach numbers upto 0.5. They showed that the accuracy and stability of the quasi one dimensional GKM scheme \verb|\cite{Xu2001}| deteriorates as turbuent Mach number increases. However the multidimensional GKM scheme \verb|\cite{xu2004}| simlified for smooth flows shows accurate and stable results as compared to high order NS based direct numerical simulations.
Kumar et. al. \verb|\cite{kumar2012}| employed multidimensional GKM scheme \verb|\cite{xu2004}| enhanced with weighted essentially non-oscillatory method (WENO) to simulate compressible turbulent flows at high turbulent Mach number of 1.75 establishing a benchmark for high Mach number direct numerical simulations.
\par
In recent years viable graphics processor unit (GPU) technology has become available for general purpose non-graphic computations. GPU platforms are purported to offer significantly higher accelerations than traditional CPUs either on their own or in a suitable combination with the traditional CPU platforms. Such acceleration can be of tremendous help for performing DNS of compressible turbulence at high Reynolds and Mach number.
In the recent past there have been some successful attempts in simulating fluid flow using GPUs. Traditionally, these speedups are reported against single CPU core serial computation. Shinn et. al. \verb|\cite{shinn2010}| performed direct numerical simulation (DNS) of turbulent flows on NVIDIA Tesla C1060 GPU in a square duct with a reported speedup of 15x. Tanno et. al. \verb|\cite{tanno2013}| performed turbulent flow simulations by lattice Boltzmann method (LBM) on GPU with a reported speedup of upto 40x on NVIDIA C2050 GPU. Francesco et. al. \verb|\cite{francesco2013}| performed DNS of a spatially evolving compressible mixing layer on NVIDIA Tesla S2070 with 22x speedup. However, all these GPU-based simulations have been performed either with Navier Stokes (NS) equations or LBM.
\par
So far to the best of the authors’ knowledge GKM simulations have been parallized on the conventional central processing unit (CPU) platforms. GPUs have single instruction multiple thread (SIMT) architecture. Hence not all simulation methodologies can be ported efficiently on GPUs. In general, Navier-Stokes based finite difference and finite volume methods are very compute friendly with GPUs. Since GKM is emerging as the preferred method for simulating high Mach number and high Reynolds number simulations, it is imperative to attempt implementing the GKM scheme on GPU platforms and systematically evaluate its performance.
\par
The objective of this work is to evaluate the performance of GKM on GPUs for simulating compressible turbulence. Towards this goal we perform direct numerical simulation of compressible decaying turbulence employing Analytical-GKM scheme, which has been recently developed by Xuan et. al. \verb|\cite{li2013}|. We extensively compare and evaluate DNS results against high-order accurate Navier-Stokes based DNS results of Honein et. al. \verb|\cite{moin2004}| and Samtaney et. al. \verb|\cite{samtaney2001}| over a range of turbulent Mach numbers (0.3-0.5) and Reynolds number (30-175).
We evaluate the speed-up and scalability of AGKM on GPU platform.
\par
This paper is organized into five sections. In section 2, we review AGKM scheme proposed by Xuan et. al. \verb|\cite{li2013}| and discuss various aspects of the scheme on GPU. In section 3, we discuss our plan of study. In section 4 we evaluate the perfromance of the AGKM scheme on GPU platform in terms of speed-up and scalabilty. Section 5, concludes the work with summary.
\section{Implementation of AGKM on GPU}
Recently Xuan et. al. \verb|\cite{li2013}| have proposed a novel improved GKM scheme (henceforth called AGKM) based on the exact analytical solution of an arbitrary $n^{th}$ order Chapman Enskog expansion of the Bhatnagar-Gross-Krook (BGK) equation \verb|\cite{bhatnagar}|. From the point of view of implementation on GPU platform, AGKM is expected to have several advantages. The reader is referred to Xuan et. al. \verb|\cite{li2013}| for the full description of the AGKM scheme.
\par
GPUs have single instruction multiple thread (SIMT) architecture and hence are very efficient in executing data parallelism. General finite volume and finite difference based solvers support data parallelism in flux calculation. The derivatives of the flow field can be calculated over the whole grid at once and the flux is easily obtained from these derivatives. This makes general finite volume and finite difference methods suitable for GPU parallelization. Although GKM based solvers have a core finite volume structure, the flux calculation does not support data parallelism.
Flux calculation involves very tedious computations since the macroscopic field is first processed for Maxwellian distribution fuctions and then using the BGK model the flux is calculated. This whole process involves a lot of computation which requires task parallelism.
\par
The GKM scheme proposed by Xu \verb|\cite{xu2004}| involves discontinuous reconstruction leading to calculation of two Maxwellians on each side of the cell interface. Using these two Maxwellian distribution functions, an equilibrium distribution function is found and from these three distribution functions the flux is calculated. The flux calculation accounts for
bulk of computation time and puts a lot of burden on the limited local thread
memory of the GPU device, making it difficult to effectively parallelize the scheme for high speedups.
\par
On the other hand the recently proposed AGKM scheme by Xuan et. al. \verb|\cite{li2013}| uses simplified and time accurate flux calculation using the analytical solution to the BGK model. The use of time accurate analytical solution avoids modelling for the temporal derivative that is otherwise used in the original GKM \verb|\cite{Xu2001}|. This makes AGKM potentially superior to higher order schemes using Runge-Kutta methods for time accuracy. The new scheme uses a continuous initial reconstruction which significantly reduces the computational cost associated with time evolution from a discontinuous initial data. Xuan et. al. \verb|\cite{li2013}| demonstrate that while the scheme preserves stability, high resolution and a fairly good performance in capturing discontinuity (like the earlier GKM method), it has the additional advantage of enhanced efficiency leading to faster computational time. The superiority of the new method is demonstrated by Xuan et. al. \verb|\cite{li2013}| by performing several stability, accuracy, time efficiency and viscous/heat-conduction tests for 1-D laminar flows. This makes AGKM suitable for GPU implementation. Thus in this work we implement AGKM on a GPU platform.
\section{Plan of study}
In the first part of this study, we perform an extensive validation of AGKM implemented on GPU. We perform simulations of decaying isotropic compressible turbulence. The computational domain is cubical with sides of length $2\pi$. Periodic boundary conditions are imposed on the opposite sides of the cube. Initial velocity field is chosen to have a spectrum of the following form:
\begin{equation}
E(k) = A k^{4}e^{-2k^{2}/k_{0}^{2}}
\end{equation}
where, $k$ represents wavenumber, $k_{0}$ is the wave number corresponding to the peak of the energy spectrum, and $A$ is a constant chosen for the desired initial kinetic energy. The initial pressure and density field is kept constant.
The initial turbulent Mach number ($M_{t0}$) and Taylor's Reynold's number $(Re_{\lambda})$ are defined as :
\begin{equation}
M_{t0} = \sqrt{\frac{}{}}
\end{equation}
\begin{equation}
Re_{\lambda} = \frac{ \lambda u_{rms}}{}
\end{equation}
where $u_{i}$ represents the velocity field and $$ represents volume averaging.
$\rho$, $c$ and $\mu$ are fluid density, local acoustic speed and fluid viscocity respectively. $u^{rms}$ is the root mean squared (rms) turbulent velocity defined as: $u^{'} = \sqrt{}$
\par
At low Mach number we validate our results by making comparisions against high order accurate NS-based results of Honein et. al. \verb|\cite{moin2004}| with initial Mach number of 0.3 and Reynolds number of 30. Honein et. al. \verb|\cite{moin2004}| employed a sixth order compact scheme along with third order Runge-Kutta scheme for time stepping. Their scheme was implemented using a conservative skew-symmetric splitting of the nonlinear terms. We perform our comparisions in terms of normalized turbulent kinetic energy ($K^{'}$) and normalized root mean square pressure ($P^{'}$), specific volume ($V^{'}$) and temperature ($T^{'}$), defined as :
\begin{equation}
K^{'} = \frac{u_{1,rms}^{2} + u_{2,rms}^{2} + u_{3,rms}^{2}}{(c_{0}M_{t0})^{2}}
\end{equation}
\begin{equation}
P^{'} = \frac{p_{rms}}{\gamma p_{0} M_{t0}^{2}}
\end{equation}
\begin{equation}
V^{'} = \frac{v_{rms}}{M_{t0}^{2}/\rho_{0}}
\end{equation}
\begin{equation}
T^{'} = \frac{T_{rms}}{(\gamma-1) T_{0} M_{t0}^{2}}
\end{equation}
where $p_{rms}$, $v_{rms}$, $T_{rms}$ represents root mean square pressure, specific volume and temperature respectively. $p_{0}$, $T_{0}$, $\rho_{0}$ and $c_{0}$ represents initial mean pressure, temperature, density and acoustic speed respectively. %The simulation time is scaled with eddy turnover time ($t_{0}$) defined as $t_{0} = 2\sqrt{3} / (k_{0}M_{t0})$.
\par
At high Mach numbers, the evaluation is performed by comaparing AGKM results against that of Samtaney et. al. \verb|\cite{samtaney2001}|. Tenth order Pade scheme was used by the Samtaney et. al. \verb|\cite{samtaney2001}| along with third order Runge-Kutta for time stepping.
We show the validation of AGKM results against two cases simulated by Samtaney et. al. \verb|\cite{samtaney2001}|, one with initial conditions: $M_{t0} = 0.5$, $Re_{\lambda} = 72$ and the other at $M_{t0} = 0.488$, $Re_{\lambda} = 30$.
Comparison is performed in terms of normalized turbulent kinetic energy $(K^{'})$, normalized root mean square density $(\rho^{'})$ and local Mach number $(M_{loc})$. The authors define these quantities as :
\begin{equation}
K^{'} = \frac{\frac{1}{2}}{K_{0}}
\end{equation}
\begin{equation}
\rho^{'} = \frac{\rho_{rms}}{M_{t0}^{2}}
\end{equation}
\begin{equation}
M_{loc} = \sqrt{\frac{u_{1}^{2} + u_{2}^{2} + u_{3}^{2}}{(\gamma p / \rho)}}
\end{equation}
where, $u_{i}$, $p$ and $\rho$ are local velocity, pressure and density respectively and $M_{t0}$ is the initial turbulent Mach number and $K_{0}$ is the initial turbulent kinetic energy. %The eddy turnover time ($t_{0}$) is defined as
%$t_{0} = \sqrt{\frac{32}{A}} (2\pi)^{1/4} k_{0}^{-7/2} $.
In Table \verb|\ref{Table1}| we present a summary of all the simulation cases that we employ for validation of AGKM scheme for compressible decaying isotropic turbulence.
\par
In the second part of the study we evaluate the speed-up and scalability aspects of GPU-implemented AGKM scheme.
We define the speed-up of GPU algorithms as the ratio of simulation run time on GPU platform as compared to single core CPU computation. This speed-up calculation methodology is accepted as a general measure for seedup in the realm of scientific computation \verb|\cite{francesco2013, shinn2010, tanno2013}| and hence we follow this same measure for the speed-up of AGKM GPU implementation.
GPU simulations on NVIDIA Tesla K20m card are compared against single core CPU computation on Intel Xeon E5-2687W, 3.1 GHz processor. The average iteration time is calculated by taking the average run time for 100 iterations taken for 10 runs of the solver.
\par
We also compare the scalability aspects of our AGKM GPU solver. Scalabilty of an algorithm is the ability of the algorithm to handle increasing problem size with equal efficiency, while maintaining the speed-up achieved at small problem sizes.
It is desirable from any parallel algorithm to exploit the hardware resource with equal efficiency for increasing problem size within the memory constaint of the hardware. To evaluate the performance of AGKM on GPU platform, several simulation cases were performed. All these test cases have identical initial conditions but different grid sizes.
\begin{table}
\caption{\\Simulation cases employed for validation of AGKM}
\begin{center}
\begin{tabular*}{400pt}{ @{\extracolsep{\fill}} c c c c c c c c} \hline
Case & $M_{t0}$ & $Re_{\lambda}$ &A &$k_{0}$ &Prandtl No. & grid & end time \\ \hline
$A$ & $0.3$ & 30 & 0.000374 &4 &0.7 & $64^3$ & $3t_{0}$ \\ %\hline
$B$ & $0.5$ & 72 & 0.00013 &8 &0.7 & $128^3$ & $8t_{0}$ \\ %\hline
$C$ & $0.488$ & 175 & 0.011 &4 &0.7 & $256^3$ & $7t_{0}$ \\ %\hline
\hline
\end{tabular*}
\end{center}
\label{Table1}
\end{table}
\section{Results and Discussion}
In this section we present our AGKM simulation results that have been computed on a GPU platform. Several simulations have been performed over a range of Mach and Reynold numbers. In section 4.1.1 we perform validation at low Mach number simulations against the results of Honein et. al. \verb|\cite{moin2004}|. In section 4.1.2 validation is performed at high Mach number against the results of Samtaney et. al. \verb|\cite{samtaney2001}|. Finally in section 4.2 we evaluate the performance of AGKM in terms of speed-up and scalability aspects on GPU platform.
\subsection{Validation}
\subsubsection{Low Mach number}
In figure \verb|\ref{M3}| we present AGKM simulation results from case A, Table \verb|\ref{Table1}|. Turbulent kinetic energy $(K^{'})$, RMS pressure $(P^{'})$, temperature $(T^{'})$ and specific volume $(V^{'})$ are plotted against normalized simulation time ($t / t_{0}$). Clearly the agreement for each quantity is excellent. Based on these comparisons we conclude that Analytical-GKM simulates low Mach number turbulent flows quite well.
\begin{figure}[htb!]
\centering
\subfloat[Turbulent kinetic energy decay]{\includegraphics[scale=.37]{TKE3-crop.pdf}}
\subfloat[RMS pressure]{\includegraphics[scale=.37]{VRMS3-crop.pdf}}\\
\subfloat[RMS temperature]{\includegraphics[scale=.37]{TRMS3-crop.pdf}}
\subfloat[RMS specific volume]{\includegraphics[scale=.37]{PRMS3-crop.pdf}}
\caption{DNS simulations using AGKM scheme on $64^{3}$ grid, $Re_{\lambda}$ = 30, $M_{t} = 0.3$ (case A, Table \verb|\ref{Table1}|)\newline Symbol $\textbf{o}$ represents data from Hoein et. al. \verb|\cite{moin2004}|, solid line $\textbf{|}$ represents AGKM simulation results}
\label{M3}
\end{figure}
\FloatBarrier
\subsubsection{High Mach number}
In figure \verb|\ref{M5}| we present AGKM simulation results from case B, Table \verb|\ref{Table1}|. Turbulent kinetic energy $(K^{'})$ and RMS density $(\rho^{'})$ are plotted against normalized time ($t / t_{0}$). The results perfectly match the simulation results of Samtaney et. al. \verb|\cite{samtaney2001}|.
In figure \verb|\ref{M488}| we present AGKM simulation results from case C, Table \verb|\ref{Table1}|. The time evolution of turbulent kinetic energy $(K^{'})$ and probability density function (PDF) of local Mach number, $P(M_{loc})$ at normalized time of $t/t_{0} = 1.56$ are plotted. Clearly the agreement for each quantity is excellent.
Based on these comparisons we conclude that Analytical-GKM simulates high Mach number and high Reynold number turbulent flows quite well.
\begin{figure}[ht]
\centering
\subfloat[Turbulent KE decay]{\includegraphics[scale=.38]{TKE5-crop.pdf}}
\subfloat[RMS Density]{\includegraphics[scale=.38]{DRMS5-crop.pdf}}\\
\caption{DNS simulations using AGKM scheme on $128^{3}$ grid, $Re_{\lambda}$ = 72, $M_{t}$ = 0.5 (case B, Table \verb|\ref{Table1}|)\newline Symbol $\textbf{o}$ represents data from Samtaney et. al. \verb|\cite{samtaney2001}|, solid line $\textbf{|}$ represents AGKM simulation results}
\label{M5}
\end{figure}
\FloatBarrier
\begin{figure}[h]
\centering
\subfloat[Turbulent KE decay]{\includegraphics[scale=.38]{TKE488-crop.pdf}}
\subfloat[PDF of local Mach number at $t/t_{0} = 1.56$]{\includegraphics[scale=.38]{MachPDF-crop.pdf}}
\caption{DNS simulations using AGKM on grid = $256^3$, $M_{t}$ = 0.488, $Re_{\lambda}$ = 175 (case C, Table \verb|\ref{Table1}|)\newline Symbol $\textbf{o}$ represents data from Samtaney et. al. \verb|\cite{samtaney2001}|, solid line $\textbf{|}$ represents AGKM simulation results}
\label{M488}
\end{figure}
\FloatBarrier
\subsection{GPU Speedup and Scalability}
The speedups recorded for 5th order scheme are mentioned in Table \verb|\ref{Table2}|. All the tests are run at the same initial condition (case B, Table \verb|\ref{Table1}|), with different grid sizes.
\begin{table}[!htb]
\caption{\\ GPU speedup: Initial conditions (case B, table \verb|\ref{Table1}|) }
\begin{center}
\begin{tabular*}{400pt}{ @{\extracolsep{\fill}} c c c c }
\hline
grid & avg. iteration time CPU (ms) & avg. iteration time GPU (ms) & speed-up \\ \hline
$64^3$ & 827 & 20 & 41 \\ %\hline
$128^3$ & 6910 & 151 & 46 \\ %\hline
$192^3$ & 25430 & 510 & 50 \\ %\hline
$256^3$ & 57482 & 1128 & 51 \\ \hline
\end{tabular*}
\end{center}
\label{Table2}
\end{table}
All the computations in the scheme are performed on the GPU device. The
above algorithm with exclusive usage of device's global memory and 3-dimensional thread indexing,
yielding a speedup of 10-14x.
\par
For better control and usage of GPU resources, 1-dimensional thread indexing is
used. Since the derivatives of density, momentum and pressure are independent
of each other, they are run simultaneously using streams. Similarly other
kernels viz. updating the flow field and periodic boundary condition
implementation are also executed concurrently using streams. Implementing the
above mentioned alterations boosts the speedup significantly across all the
problem sizes discussed in the paper. The block size has a significant effect on
the speedup. While working with multiple streams, it is desirable to reduce the
block size. The most optimum block size while executing streams is 128 threads;
while that for the flux kernel is 64 threads. This fine tuning results in a
further performance boost, reaching a total speed of 40-50x across
problem sizes as shown in Table \verb|\ref{Table2}|.
\par
In Table \verb|\ref{Table2}| we observe that the current scheme is scalable on GPU platform. The speed up {\it
increases} with increasing grid size as shown in the Table \verb|\ref{Table2}|. This shows that the GPU resources are more optimally utilized at higher grid size. This is evident from the fact that power usage increases from $\scriptstyle\mathtt{\sim}$95 W at $64^3$ grid size to $\scriptstyle\mathtt{\sim}$130 W at $256^{3}$ grid size, confirming that more GPU cores are being utilized concurrently.
\section{Conclusions}
The purpose of this study is to demonstrate the efficiency of the newly developed Analytical GKM scheme
by Xuan et. al. \verb|\cite{li2013}| for simulating compressible turbulence on GPUs.
The analysis is done in terms of accuracy of the DNS results as compared with high order NS based DNS simulations
and in terms of speedup and scalability of the algorithm on GPUs.
We evaluate the performance of the analytical gas kinetic
method in simulating fully resolved compressible decaying turbulence over a
range of Mach and Reynolds number. We demonstrate that AGKM scheme shows excellent agreement with
high order accurate N-S based DNS results. We conclude that AGKM is far more suitable to be implemented on GPUs as compared to original multidimensional GKM. The AGKM scheme was successfully implemented on NVIDIA Tesla K20m GPU, with a speedup of 40-50x across problem sizes. We observe that the AGKM scheme is scalable on GPU platforms.
The current memory restriction on the GPU puts a constraint on grid sizes
greater that $256^3$. Hence a data division scheme over multiple GPUs would be
necessary in order to keep the total run time tractable when higher grid sizes
are required.
%\bibliographystyle{plain}
%\onehalfspacing
\section*{References}
\bibliography{reference}
\end{document}

this is for holding javascript data