Authorea

Richard Otis edited Quasi_random_sampling_N_D__.md almost 9 years ago

Commit id: 370f4f76d5fa5a178c2d9f0eb80f9444839d1762

deletions | additions

## Quasi-random sampling (N-D) Quasi-random sampling is a technique that attempts to combine the statistical advantages of (pseudo-)random sampling with a fully deterministic algorithm. (Determinability is a desirable property from a reproducibility perspective because it ensures that, in principle, users with the same version of the software will always converge to exactly the same solution.) A quasi-random sequence is a deterministic sequence of 'random-like' "random-like" values that can be generated by an algorithm. What makes quasi-random numbers sequences superior to pseudo-random numbers for sampling is that quasi-random numbers sequences are biased toward spreading out "spreading out" and covering more of the domain. For example, the Halton sequence is a quasi-random sequence of numbers generated over a fixed interval by using a prime number as a base \cite{Halton_1964}. For \(N \lt 5\), The \(N\)-dimensional Halton sequence is low-discrepancy, meaning it is very similar to the uniform distribution in terms of domain coverage over \((0,1)^N\). For large \(N\), a scheme for permuting the coefficients must be used to prevent linear correlations between dimensions [shuffled/scrambled Halton ref, Hess and Polak, 2003]. Note that the Halton sequence can also be calculated over arbitrary sub-regions of \((0,1)^N\), making it suitable for refinement schemes. While pycalphad makes heavy use of Halton sequences, the current version does not implement refinement with quasi-random sequences. The Halton sequence alone does not satisfy our purposes for sampling composition space because it samples the \(N\)-hypercube. We must instead sample the N-simplex in order to obey the site fraction balance constraint. The most straightforward approach is to normalize each generated point so that it sums to unity, but this will tend to oversample the middle of the simplex {>> TODO: ternary oversample figure versus exponential distribution <<}. Another approach is rejection sampling, which is to throw out all points that fall outside the \(N\)-simplex. The problem is that the fraction of rejected points quickly increases with \(N\), making it very inefficient for multi-component systems. A more robust solution is to take advantage of two facts: (a) every \(N+1\) samples from the exponential distribution, when normalized to unity, will be uniformly distributed on the N-simplex, and (b) if \(X\) is uniformly distributed over \((0,1)\), then \(ln(X)\) is exponentially distributed. Because the Halton sequence is low-discrepancy, it can be substituted for the uniform distribution when generating exponentially distributed points. For phases with multiple sublattices, all degrees of freedom can be sampled independently and then each sublattice's degrees of freedom can be normalized to unity.