Xavier Andrade added Parallelization2.tex  over 9 years ago

Commit id: 5797e7a6e84f46c1fc4eb134a0cdf853bbe51ccd

deletions | additions      

         

Parallelization in Octopus is performed on different levels. The most  basic one is domain decomposition, were the grid is divided in  different regions that are assigned to each processor. For most  operations, only the boundaries of the regions need to be communicated  among processors. Since the grid can have a complicated shape dictated  by the shape of the molecule, it is far from trivial to distribute the grid-points among processors. For this tasks we use a third party library called {\sc  ParMETIS}~\cite{Karypis_1996}. This library provides routines to  partition the grid ensuring a balance of points and minimizing the size  of the boundary regions, and hence the communication costs. An example  of grid partitioning is show in Fig.~\ref{fig:partitioning}.  Additional paralellization is provided by other data decomposition  approaches that are combined with domain decomposition. This includes parallelization over (k\)-points and spin, and over Kohn-Sham states.   The first parallelization strategy is quite efficient, since for each \(k\)-point or spin component the operations are independent. However, it is limited by the size of the system, and often cannot be even used (as in the case of closed shell molecules, for example).  The efficiency of the parallelization over Kohn-Sham states depends on the type of calculation being performed. For ground state calculations, the orthogonalization and subspace diagonalization routines~\cite{Kresse_1996} require the communication of states. In Octopus this is handled by parallel dense linear-algebra operations provided by the ScaLapack library~\cite{scalapack}. For real-time propagation, on the other hand, the orthogonalization is preserved by the propagation~\cite{Castro_2006} and there is no need to communicate Kohn-Sham states between orbitals. This makes real-time TDDFT extremely efficient in massively parallel computers~\cite{Andrade_2012,Schleife_2014}.  An operation that needs special care in parallel is the solution of   Poisson equation. Otherwise, it constitues a bottleneck in parallelization, as a  single Poisson solution is required independently of the number of states in the system. A considerable effort has been devoted to the  problem of finding efficient parallel Poisson solvers that can keep up  with the rest of the code~\cite{Garc_a_Risue_o_2013}. We have found that the most efficient methods are on FFTs, which require a different domain  decomposition to perform efficiently. This introduces the additional  problem of transfering the data between the two different data  partitions. In Octopus this was overcome by creating a mapping at  initialisation stage and using it during execution to efficiently  communicate only the data that is strictly necessary between  processes~\cite{Alberdi_2014}.