ROUGH DRAFT authorea.com/9002

# A GPU fast-speed approach to identify atomic vacancies in solids materials.

Abstract

Identification of vacancies in a structure play a crucial role in the characterization of the material from structural to dynamical properties. In this work we introduce a computationally improved vacancy recognition algorithm technique, based in a previous developed method. The procedure is based in the use of Graphics Processing Unit (GPU) instead of Central Processing Unit (CPU), taking advantage of random number generation as well the use of a large amount of simultaneos threads as the available in GPU architecture, improving the spatial mapping in the sample and the speed during the identification process of atomic vacancies. The results show that with this technique, efficency is obtained along the reduction of required parameters in comparison with the original algorithm. We show that only the lattice constant and the network structure are enough as input parameters in the process, and are also highly related. A study of those parameters is presented, suggesting how the parameters choice must be addressed. Timing comparison were made using one standard CPU and GPU between the original code and the present work, what follows an improve in the execution time.

# Introduction

Nowadays there is no well-stablished procedure to identify a vacancy in a structure spanshot provided by computational simulations techniques. It has been show that the identification of vacancies it is fundamental to understand different materials properties as: i) electronic and mechanical behaviour due to the presence of vacancies in oxide/intermetallic alloy interfaces (Maurice 2004, Badura-Gergen 1997); ii) the relevance of their migration near to melting temperature(Zhang 2013, Davis 2011), which provide a relevant information on melting process; and iii) collapse of crystals, where simulations suggest a strong connection with ring-like atomic movement, due to vacancies (Delogu 2005, Bai 2008), among others. A previous work (Davis 2011) gives us a complete and well guided process to identify a vacancy in a crystaline or amorphous structure, by the use of virtual spheres (VS).

This original method (Davis 2011) call for at least three different parameters to fit, and in a set of $$N$$ atoms, the time scale as $$\sim\mathcal{O}(N^2)$$. This work propose redesign the algorithm by the use of GPU architecture, which reduce the number of parameters as well improve the speed of the process. The GPU architecture is an scheme based in multi-threading where each thread could be slower than a regular CPU, and so the advantage relays in the number of simulatenous threads that a GPU is capable to execute. Albait CPU still used as basis for highest performance computers in the last decades, this is clearly changing(Ciżnicki 2012). The above is mainly due to the cost, where the use of commodity-scale processors in supercomputing clusters it is considerable more expensive that the price compared with a GPU (or an hibrid CPU+GPU system), which is more usable in general-purpose computations than 10 years ago. Along with this transition came the necessity to adapt the computer codes to these cache-based systems because their different memory access scheme would make vector-machine codes perform poorly otherwise. Nowadays a considerable number of codes is migrating to GPU or hibrid systems (Danovaro 2014). In order to do so, the codes had to usually be modified or were written from scratch. That last is trying to be changed by a new scheme of shared memory, that NVIDIA  has incorporated in the last version of CUDA 6.0, which simplify the migration of regular CPU code to GPU code 1.

In this work we will use the NVIDA CUDA (Nickolls 2008) version 6.0 to generate the GPU code. A good performance scaling is expected when the calculations are possible to split in different sub-tasks, and assign these to each GPU thread. In this particular case, the algorithm it is paralelizable mainly because we can analyze simultaneously diferent regions of the space and thus found vacancies faster. Our techinique will allow us to reduce the analysis time on the structure in a time-scale of $$\sim\mathcal{O}(N^2)$$ ?? .... Alongside we additionally disscard one parameter to fit and relate the others in a general way, based completely on the unit cell of the material under study.

The work is distributed as follows: Section \ref{computational-procedure} show a detailed information of the material used to test the algorithm, and the algorithm steps; the section \ref{results} present the results for different stages of the process: fit parameters, results comparative, and speed-up tests; a final discussion of the results and scopes of the present work are show in section \ref{discussion}.

1. http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

# Computational Procedure

\label{computational-procedure} As a test model of the algorithm we use a tungsten BCC structure at different temperatures, the samples were prepared from cristalline structures with a lattice constant of $$a=3.561$$Å. The temperatures choosed are 300K, 1000K, 2000K, 3000K, 3500K and 6000K. Each sample is simulated during 50000 $$\Delta$$t with $$\Delta$$t=1 fs. The first 30000$$\Delta$$t a temperature control was used, by rescaling velocities. The last 20000$$\Delta$$t were simulated without any disturbance. The simulations were made using a standard molecular dynamics code(Davis 2010), the inter–atomic potential used for Tungsten was the proposed by Finnis–Sinclair(Ackland 2009). A final structure, as is show in Figure \ref{tungsten-1000K}, was choosed for each temperature and used to generate random vacancies, and test the code. We will identify this structure later on as the initial structure, associated at each temperature.

In this work we present the results of the algorithm using a NVIDIA Quadro K6000 card, with a compute capability 3.5 and driver version 6.0. The CUDA version used is the version 6.0 which support new shared memory scheme, which is used in the vacancies and atom managment, taking advantage of the C++ class in the algorithm.

## The algorithm

A recently method(Davis 2011), show a procedure that recognize atomic vacancies based in the incorporation of a virtual spheres (VS), which are overlaped with the atomic neighborhood. Based in this overlap value, the VS is randomly displaced a finite number of times using a ficticious temperature (annealing minimization method). Finally, the VS could(or not) be considered as a vacancy based on the final overlap value. In this work we keep the use of a VS as the main component in the atomic vacancy search. However the spatial mapping of the neigborhood of this VS is analyzed using GPU based algorithms techniques. The above generate a fast analysis for a considerable large number of VS (and possible vacancies) in the space. The technique will allow us to avoid the use of additional methods in the procedure, such as annealing minimization adjustments, which bring additional parameters to be fit in the algorithm. The overlap function used in this work is given by:

$f(r/R_0) = 1 - 0.75 (r/R_0)^2 + 0.0625 (r/R_0)^3,$

where $$R_0$$ correspond to the VS radius, and is a parameter related to the problem. The $$f(r/R_0)$$ function is restricted to $$0 \leq r \leq 2R_0$$, which implies than an overlap contribution of each neighbor atom is between 0 and 1. With this definition of overlap function in mind, we will proceed to describe the algorithm bellow.

As a first step, the algorithm determine the atomic overlap of each atom in the structure associated to his neighborhood, defined by $$r<2R_0$$, using a single GPU thread for each. A sorting process, from lower to higher overlap, is carried out on the atoms in order to analyze them in this order. For each atom, we build a three dimensional cube around it of side $$2R_0$$. A large number of random points are generated homogeneously inside that cube, using the method of curandGenerateUniformDouble of the CUDA CURAND libraries1, the number of random points generated will improve the precision of the vacancy found, but also reduce the performance of the search process, as we will discuss later.

Once the overlap are determined for each random point (at each GPU thread), we search for the miminum overlap. Based in a criteria value $$f_{ovp}$$, the minimum overlap could or could not be designated as a vacancy. Once the point is recognized as a vacancy, a single atom is located in that position, to avoid multi–vacancies overlap. The overlap value of every atom in the structure is updated once a vacancy is found.

Finally, recognize a VS as a vacancy it will be directly related to the choice of $$f_{ov}$$ as well the $$R_0$$ value. This will vanish the requeriment of ficticious temperature or any other additional parameter, as in the original algorithm (Davis 2011). The new technique is resumed as:

1. Sort original structure atoms by their overlap.

2. Build a cubic structure around an atom and generate uniformly distributed random points inside.

3. Using a large number of GPU threadsw, evaluate the overlap for each random point.

4. Search and found the minimum overlap value of the set of random points.

5. Identify if the point is a vacancy using the condition of overlap < $$f_{ov}$$

In what follows, we present guidelines that could help to determine the necessary parameters to search (and found) vacancies in a solid material.

1. https://www.clear.rice.edu/comp422/resources/cuda/html/curand/index.html