Authorea

Xavier Andrade added Parallelization2.tex over 9 years ago

Commit id: 5797e7a6e84f46c1fc4eb134a0cdf853bbe51ccd

deletions | additions

Parallelization in Octopus is performed on different levels. The most basic one is domain decomposition, were the grid is divided in different regions that are assigned to each processor. For most operations, only the boundaries of the regions need to be communicated among processors. Since the grid can have a complicated shape dictated by the shape of the molecule, it is far from trivial to distribute the grid-points among processors. For this tasks we use a third party library called {\sc ParMETIS}~\cite{Karypis_1996}. This library provides routines to partition the grid ensuring a balance of points and minimizing the size of the boundary regions, and hence the communication costs. An example of grid partitioning is show in Fig.~\ref{fig:partitioning}. Additional paralellization is provided by other data decomposition approaches that are combined with domain decomposition. This includes parallelization over (k\)-points and spin, and over Kohn-Sham states. The first parallelization strategy is quite efficient, since for each \(k\)-point or spin component the operations are independent. However, it is limited by the size of the system, and often cannot be even used (as in the case of closed shell molecules, for example). The efficiency of the parallelization over Kohn-Sham states depends on the type of calculation being performed. For ground state calculations, the orthogonalization and subspace diagonalization routines~\cite{Kresse_1996} require the communication of states. In Octopus this is handled by parallel dense linear-algebra operations provided by the ScaLapack library~\cite{scalapack}. For real-time propagation, on the other hand, the orthogonalization is preserved by the propagation~\cite{Castro_2006} and there is no need to communicate Kohn-Sham states between orbitals. This makes real-time TDDFT extremely efficient in massively parallel computers~\cite{Andrade_2012,Schleife_2014}. An operation that needs special care in parallel is the solution of Poisson equation. Otherwise, it constitues a bottleneck in parallelization, as a single Poisson solution is required independently of the number of states in the system. A considerable effort has been devoted to the problem of finding efficient parallel Poisson solvers that can keep up with the rest of the code~\cite{Garc_a_Risue_o_2013}. We have found that the most efficient methods are on FFTs, which require a different domain decomposition to perform efficiently. This introduces the additional problem of transfering the data between the two different data partitions. In Octopus this was overcome by creating a mapping at initialisation stage and using it during execution to efficiently communicate only the data that is strictly necessary between processes~\cite{Alberdi_2014}.