authorea.com/20581

PyMKS Package Paper

- Current practices for developing Materials Data Analytics Tool-Sets are highly localized within a few individual groups resulting in major inefficiency (unnecessary duplication of codes, inadequate verification and validation of multiple instantiations of code, not engaging the right talent for the right task, etc.)
- Community development and sharing of code repositories has been successful in certain science communities. Advantages of this approach include increased e-teaming and e-collaborations, vastly improved code hygiene, promotion of open science, rapid verification and validation, and a dramatic increase in overall productivity.
- Community development of a materials data analytics tool-set can significantly change the landscape of the emerging cross-disciplinary field of Materials Informatics and address the critical needs outlined in MGI and ICME. This is probably the only practical way to get materials scientists and computer scientists to establish meaningful, mutually beneficial, and highly productive collaborations. Current efforts in materials science are:
- PyMatgen http://pymatgen.org/
- Other projects http://materials-informatics-lab.github.io/material-hammers/

- PyMKS aims to seed and nurture an emergent user group in the materials data analytics for establishing homogenization and localization linkages by leveraging open source scientific and machine learning packages in Python. The approach used to develop PyMKS as well as several examples are presented. This paper is a call to others interested in participating in this open science activity.

This section outlines our approach to development and also how we hope to engage the wider community in its use. None of this stuff should be specific to MKS

- Use abstractions from other libraries, don't invent our own
- Use open source dependencies
- Have a permissive licence
- Have interactive online notebooks (**)
- Use VMs with the full stack (**)
- Test suite integrated with the documentation and the examples (**)
- Use Python due to its large scientific code base - high lever etc adoption in machine learning
- Use continuous integration tool (**)
- integration of docs and tests
- work with MGI practioner's (I'm doing this with Shengyen) code bases and integrate parts into MKS (**)
- (**) Need to be addressed and we should address before publication

test edit

Very short section on theory primarily referencing other papers.

- Digital Microstructure function
- Basis selection for local states and FFT for space
- Homogenization (2-Point Statistics Equation)
- Workflow: Spatial Correlatiions -> Dimensionality Reduction -> Regression

- Localization (MKS equation) with consistent notation to the online documentation
- Assumptions/Boundary Conditions
- explain how they go together and in theory imporve materials design

- Assumptions/Boundary Conditions

The first step in all of the MKS work-flows is to discretize the microstructures. In order to do this we introduce a probabilistic description of the microstructure using the continuous local state variable $h$, the local state space $H$ and the microstructure function $m(h, x)$. The local state space $H$ can be thought of as all of the thermodynamic state variables that are needed to uniquely define the material at a given location. The local state variable $h$ is one instance of the local state space, or one configuration of state variables. The microstructure function $m(h, x)$ is a probability density function of finding a local state $h$ at location $x$. For instance let $\mu(x)$ be a microstructure that we plan to discretize, then $\mu$ is the expectation of the microstructure function.

$$ \mu(x) = \int_H h m(h, x) dh $$

Now we will discretize the microstructure in space by averaging over small cubic domains in the microstructure function. The local state can be discretized using two methods one is to bin the microstructure using the primitive (or hat) basis $\Lambda$

$$ \frac{1}{\Delta x} \int*{H} \int*s \Lambda(h - l) m(h, x) dx dh = m[l, s] $$

the other is to using a spectral representation using some orthogonal basis function $\xi$

$$ \frac{1}{\Delta x} \int*{s} m(h, x) dx = \sum*{l=0}^{L-1} m[l, s] \xi_l (h) $$

In the notation above all of the round brackets are used to indicate continuous variables while the square brackets indicate discrete variables. The variables $s$ and $S$ represent a discrete position and the total volume, while $l$ and $L$ represent the discrete versions of $h$ and $H$. In PyMKS the Legendre polynomials are currently the only orthgonal basis functions available

n-point spatial correlations provide a way rigorously quantify material structure using statistics. As an introduction n-point spatial correlations, let's first discuss 1-point statistics. 1-point statistics are the probability that a specified local state will be found in any randomly selected spatial bin in a microstructure [1][2][3]. 1-point statistics compute the volume fractions of the local states in the microstructure. 1-point statistics are computed as

$$ f[l] = \frac{1}{S} \sum_s m[s,l] $$

In this equation, $f[l]$ is the probability of finding the local state $l$ in any randomly selected spatial bin in the microstructure, $m[s, l]$ is the microstructure function (the digital representation of the microstructure), $S$ is the total number of spatial bins in the microstructure and $s$ refers to a specific spatial bin.

While 1-point statistics provide information on the relative amounts of the different local states, it does not provide any information about how those local states are spatially arranged in the microstructure. Therefore, 1-point statistics are a limited set of metrics to describe the structure of materials.

2-point spatial correlations (also known as 2-point statistics) contain information about the fractions of local states as well as the first order information on how the different local states are distributed in the microstructure.

2-point statistics can be thought of as the probability of having a vector placed randomly in the microstructure and having one end of the vector be on one specified local state and the other end on another specified local state. This vector could have any length or orientation that the discrete microstructure allows. The equation for 2-point statistics can found below.

$$ f[r \vert l, l'] = \frac{1}{S} \sum_s m[s, l] m[s + r, l'] $$

In this equation $ f[r \vert l, l']$ is the conditional probability of finding the local states $l$ and $l'$ at a distance and orientation away from each other defined by the vector $r$. All other variables are the same as those in the 1-point statistics equation. In the case that we have an eigen microstructure function (it only contains values of 0 or 1) and we are using an indicator basis, the the $r=0$ vector will recover the 1-point statistics.

When the 2 local states are the same $l = l'$, it is referred to as a autocorrelation. If the 2 local states are not the same it is referred to as a cross-correlation.

Higher order spatial statistics are similar to 2-point statistics, in that they can be thought of in terms of conditional probabilities of finding specified local states separated by a prescribed set of vectors. 3-point statistics are the probability of finding three specified local states at the ends of a triangle (defined by 2 vectors) placed randomly in the material structure. 4-point statistics describes the probability of finding 4 local states at 4 locations (defined using 3 vectors) and so on.

While higher order statistics are a better metric to quantify the material structure, the 2-point statistics can be computed much faster than higher order spatial statistics, and still provide information about how the local states are distributed. For this reason, only 2-point statistics are implemented into PyMKS.

Homogenization can be used to determine effective or homogenized properties a material, and provides a way to multiscale from the bottom up. Below is some technical information on the MKS Homogenization work flow.

The first step in MKS homogenization is to compute the 2-point statistics for each of the microstructures in the calibration dataset. For more information about 2-point statistics see the section above.

Once we have computed the 2-point statistics for every microstructure in our calibration dataset, we need to determine which of these microstructure features are most important and need to be pass to the higher length scale. We can do this using dimensionality reduction techniques from machine learning.

PCA is the most common the dimensionality reduction technique used in the MKS Homogenization work flow, and provides an efficient way to find a small number of microstructure descriptors that capture most of the variance in the microstructures.

Once we have the low dimensional microstructure descriptors, we can use regression methods to map an effective property into the low dimensional space.

Multivariate Polynomial Regression has been the most common regression technique used to connect the low dimensional microstructure descriptors to an effective property.

An effective property for a new microstructure is predicted by translating into the microstructure into the low dimensional space and using the regression technique.

Localization can be used to determine how an applied boundary condition is locally distributed within a microstructure, and provides method to multiscale from the top down. Below is some technical information on the MKS Localization work flow.

Once the state space is discretized, the relationship between the response field $p$ and microstructure function $m$ can be written as,

$$ p\left[s\right] = \sum*{r=0}^{S-1} \sum*{l=0}^{L-1} \alpha\left[l, r\right] m \left[l, s - r\right] + ...$$

where the $\alpha$ are known as the influence coefficients and describe the relationship between $p$ and $m$. The localization requires periodic boundary conditions.

## Share on Social Media