Authorea

Xavier Andrade edited Parallelization.tex over 9 years ago

Commit id: 38854903601dfc6de81eb10002eba860a43261bd

deletions | additions

\section{Parallelization, optimizations and GPUs} Joseba, Micael, Xavier Computational cost has been and it is a fundamental factor in the development of electronic structure methods. The methods, as the small spatial dimensions and fast movement of electronsmake simulations severely limit the size of systems that can be studied. In order to study systems of interest as realisticaly and accurately as possible, electronic structure codes must execute efficiently in a realistic modern computational platforms. This implies support for massively parallel platforms and accurate manner. modern parallel processors, including graphics processing units (GPUs). In This is not only important for established methods, but also for the development implementation in code of new theory, computational cost is also an important factor. theories. Ideally, developers should be isolated as much as possible from the optimization and parallelization requirements. The simplicity of real-space grids allows us to provide developers of new method with building blocks that they can use to produce highly efficient code without caring about the details of the implementation. In most cases this blocks allow developers to write code that is automatically parallel, efficient and that transparently run in massively parallel processors like graphics processing units (GPUs). Real-space grids allows us to provide developers The type of new method with building blocks that they can use operations avalilable go from simple ones like integration, linear algebra, and differential operators, to produce highly efficient code without caring about more sophisticated ones like the details application of the implementation. In most cases this blocks allow developers to write code that is automatically parallel, efficient and that can run in massively parallel processors like graphics processing units (GPUs). a Hamiltonian or even solvers for differential equations. However, it is critical to expose an adequate level that hides the performance details while still giving enough flexibility to the developers. For example, we have found that the traditional picture of an state as the basic object is not adequate for optimal performance, as it does not expose enough data parallelism~\cite{Andrade_2012_gpus}. In Octopus we use a higher level interface where the basic object is a group of several states. For example, we have found The developers work with a linear array that contains the traditional picture values ofan state as the basic object is not adequate field for optimal performance, as it does not expose enough each grid point. Additional data parallelism~\cite{Andrade_2012_gpus}. In Octopus we use a higher level interface where structures provide information about each grid point and the basic object is a group structure of the grid. This level of abstraction makes it simple for developers to write code that works for different problem dimensionality, and different kinds and shapes of several states. grids. The developers are exposed to a linear array that contains the values In terms of the field for each grid point. Additional data structures provide information about each grid point and performance, by hiding the structure of the grid. This level of abstraction makes it simple for developers to write code that works for different problem dimensionality, and different kinds and shapes of grids. In terms of performance, this approach allows us to grid, we can use sophisticated methods to control how the grid points are stored in memory with the objective of using processor caches more efficiently. efficiently in finite-difference operators. We have found that by using space filling curves~\cite{Peano_1890}, as shown in Fig.~\ref{fig:gridmaps}, in particular the Hilbert curve\cite{Hilbert_1891,Skilling2004}. Parallelization in Octopus is performed on different levels. The most basic one is domain decomposition, were the grid is divided in different regions that are assigned to each processor. For most operations, only the boundaries of the regions need to be communicated among processors. Since the grid can have a complicated shape dictated by the shape of the molecule, to perform the partition of the grid among processors we use a third party library called {\sc ParMETIS}~\cite{Karypis_1996}. This library provides routines to divide the grid ensuring a balance of points and minimizing the size of the boundary regions, and hence the communication costs. An example of grid partitioning is show in Fig.~\ref{fig:partitioning}. In order to numerically calculate physical observables we need to discretise the problem. To this end, we use a mesh representation (e.g. which changes the charge density and the electrostatic potential to discrete functions), with values defined over a finite number of points distributed over a mesh. Two parameters define the adaptive mesh (Figure \ref{fig:partitioning}): the length (spherical radius from all atoms) and the spacing (distance between two consecutive points). The shape of the mesh can be also parallelepiped or sphere. Once the mesh is constructed, it is partitioned over the processes (6 in the example), using highly efficient {\sc ParMETIS} \cite{Karypis_1996} library. Real-space mesh approach is really efficient in terms of performance. {\sc Octopus} has shown a high performance in current architectures \cite{Alberdi_2014}, \cite{Andrade_2012,Alberdi_2014}, not only because of the mesh-partitioning but also because of other kind of parallelizations. One of those natural way of parallelization is the state parallelization. Indeed, every state (or wave-function) of the system is represented over a discrete mesh, being each state independent of the others. In general, besides the general mesh, another one is used for the Poisson solver. This latter one has to be necessarily parallelepiped and it will be denoted as \emph{cube}. Involved number of MPI processes in the general mesh and in the \emph{cube} mesh might be different, so a mapping between both is computed and saved. This map enables the possibility for a rapid transfer between both representation.