Basin-volume notes

The end result of this project will be a series of volumes sampled from an unknown probability distribution. We want to determine what that distibution will be. We will have a functional form for that distribution \(P(V | \theta )\) where \(\theta\) are the unknown parameters of the distibution. In the past we’ve used a generalized Gaussian. \(P(V | \theta)\) is the biased distribution, so it will be the generalized gaussian times \(V\). This is mostly taken from http://en.wikipedia.org/wiki/Distribution_fitting

There are several ways to determine what the values of \(\theta\) should be. None of which involve histogramming.

http://en.wikipedia.org/wiki/Maximum_likelihood

The likelihood is the joint probability distribution of all the observations given the set of parameters \(\theta\) \[L(V_1, ..., V_N | \theta) = \prod_{i=1}^{N} P(V_i | \theta)\] If we maximize this (actually the log of this) we get the maximum likelihood estimate for \(\theta\).

This seems fairly ststraightforward. I suspect it reduces to maximum likelihood methods

We record a number of moments of the observed variables, e.g. \( \langle V \rangle\), \( \langle V^2 \rangle\), etc. We then fit analytical moments of \(P(V|\theta)\) to the observed moments. The advantage of this is that you don’t need to save every data point, just the running averages. The downside is obviously that you’re only fitting a few parameters. This might be good if you have too many observations to realistically record them all.

http://en.wikipedia.org/wiki/Bayesian_inference#Parametric_formulation

With this method you start with some prior distribution over the parameters \(P(\theta)\) and apply Bayes’ theorem iteratively to update the distribution given the additional knowledge (the fact of the observation). \[P(\theta | V_1) = \frac{ P(V_1 | \theta) P(\theta) }{ P(V_1)}\] The denomenator is simply the integral of the numerator over all possible values of \(\theta\) (kind of like the partition function). After applying all the updates we have \[P(\theta | V_1, \dots, V_N ) = \frac{ L(V_1, ..., V_N | \theta) P(\theta) }{ P(V_1, \dots, V_N)}\] where \(L\) is the likelihood as defined earlier. To find the best parameters \(\theta\) we then maximize the log probability. \[\log P(\theta | V_1, \dots, V_N ) = \log L(V_1, ..., V_N | \theta) + \log P(\theta) + \text{const}\]

The advantage of the Bayesian method is that it allows you to naturally incorporate prior information about the distribution of \(\theta\). At a minimum this allows you to place bounds, e.g. if some parameters must be positive. It also allows you to incorporate prior beliefs. If you think \(\theta_1\) should be a small parameter you can make \(P(\theta) \sim \exp(-\alpha \theta_1^2)\)

We can also use the principle of maximum entropy to obtain the prior probability distribution. This would be the prior that corresponds to the least assumptions on our part (the least informative prior). I suspect this would result in the flat distribution, but I’m not sure. Bishop §2.4.3 discusses non-informative priors. He says that a non-informative prior should reflect symmetries of the probability distribution. For example, the mean \(\mu\) of a Gaussian distribution is translationally invariant, so the prior on \(\mu\) should be as well, so the non-informative prior is the flat distribution. The lenth scale \(L\) of a Gaussian distribution is scale invariant, so the non-informative prior should be \(\sim 1/L\).

We might be able to formulate the requirement that \(P(V) \to 0\) for large \(V\) in a more flexible way using priors rather than the blunt way of specifing the class functional form of \(P(V | \theta)\).

Another advantage of the Bayesian method is that it gives you a full probability distribution over \(\theta\). Which is a lot more information than a simple maximum likelihood estimate. (I’m not sure if the likelihood function can be thought of as a probability distribution. I suspect that the Bayesian probability simply reduces to the Likelihood if the prior \(P(\theta)\) is flat. But I can’t say for sure.)

Bishop §3.4 talks about Bayesian model comparison. In our situation, different models would correspond to different functional forms for \(P(V|\theta_i)\) with \(\theta_i\) being parameters of model \(i\) (\(M_i\)). If we define the data \(D = \{ V_1, ..., V_N \}\) then \[p(M_i | D) \propto p(M_i) p(D|M_i)\] where \(p(M_i)\) is the prior over \(p(M_i)\) which we will assume is flat. \(p(D|M_i)\) is known as the model evidence. It is also sometime called the marginal likelihood because it can be viewed as the likelihiood function over the space of models, in which the parameters of the models have been marginalized out (integrated out). The ratio of model evidences for two models is known as the Bayes factor.

Choosing the model with the highest evidence is called model selection. Even better is to combine multiple models (mixture distribution) \[p(V| D) = \sum_i p(V|D, M_i) p(M_i | D)\]

If the prior (\(p(\theta)\)) is improper (not normalizable) then the evidence is not defined. It may however, still be possible to compare different models by taking the ratio.

If we want to approximate the probability density without specifying a functional form we can use density estimation http://en.wikipedia.org/wiki/Density_estimation (also see Bishop §2.5), especially kernal density estimation http://en.wikipedia.org/wiki/Kernel_density_estimation. This is a bit like histogramming with smoothing, but you end up with an analytic form for the density at the end. This can easily be done using http://scikit-learn.org/stable/modules/density.html

Once we have all of the above tools, we should be able to compute the entropy or total number of basins in several different ways. These correspond to different entropy definitons and analysis techniques.

Description: This is the granular entropy according to Asenjo (Asenjo 2014)

Definition: \[S_\text{APF}^* = \langle F \rangle_\text{biased} + \log(V_\text{acc})\] \[S_\text{APF} = S_\text{APF}^* - \log(N!)\]

Requirements: Needs the \(\log(V)\) values as obtained from the basinvolume computation.

Description: JackLogOmega gives the LogOmega entropy (log of number of basins), after unbiasing the distribution with fit to generalised gaussian CDF and numerical integration as in Asenjo14.

Definition: \[S^\star = \log \Omega = \log(V_\text{acc}) - \log(\text{unbiased mean volume})\] \[S = S^\star - \log(N!)\]

Requirements: Needs an analytical (or, at least numerically integrable) expression for the un-biased basin volume distribution. Currently this is obtained by fitting a generalised gaussian to the empirical CDF.

3.) MLLogOmega gives LogOmega from ML estimate for LogOmega, after maximum likelihood fit of generalised gaussian.

TODO: 4.) BayesianLogOmega uses Bayesian inference to get the parameters of the generalised gaussian.

note (jake): We should compute the entropy by averaging over all possible values of the parameters (\(\theta\) above) weighted by the bayesian posterior. In other words, we should integrate out the parameters. This is preferable to the maximum likelihood result. \[\text{estimate for the entropy} = \int \text{entropy}\, P(\theta | V_1,...,V_N) d \theta\]

5.) NonParametricLogOmega constructs a non-parametric description of the biased distribution and integrates that with the un-biasing factor to get LogOmega without the assumption of the genealised gaussian. This needs some kernel density estimation or smoothing or similar to get the PDF description. Bandwidth selection is done using Silverman’s rule of thumb as initial guess for integrated squared error cross-validation (Bowman 1984).

Early visual inspection of the square displacement, \((x-x_0)^2\), histograms showed a sudden, innatural, truncation of the histogram at some large displacements. The point at which the truncation occurrs grows as the number of iterations increases. This observation suggests that the system has not reached equilibration and, as a matter of fact, it is still diffusing within the basin. This is a serious concern because when measuring the mean square displacement we want to make sure that the number of iterations is sufficiently large to reach the boundaries of the basin of attraction when the walker is not coupled to the origin (\(k=0\)).

Asenjo et al. (Asenjo 2014) found that the weakly coupled replicas have the tendency to get trapped into far and tiny regions of the basin. Swapping these replicas with the more strongly coupled replicas (with larger \(k\)), that oscillate around the origin, then improves equilibration. Yet, the timescales over which we were performing the calculations (1e5-1e6) did not seem to be sufficient to reach the boundaries of the basin. See Fig. (\ref{fig:timeseries}) for an example.

Since we do not know how long it will take to reach the boundary of the basin and we cannot afford to run the calculations for an exceedingly large number of steps, we would like to develop an automated way of adapting the calculation length to reach some target relative statistical error. The statistical error \(\Delta_A\), the root-mean square deviation of the sample mean \(\overline{A}\) from the true expectation value \(\langle A \rangle\), is given by: \[\label{eq:delta1}
\Delta_A^2 = \frac{\text{Var}A}{M} + \frac{1}{M^2}\sum_{i \neq j = 1}^M (\langle A_i A_j\rangle - \langle A \rangle^2),\] where \(M\) is the sample size. Under the assumption of independence \[\langle A_i A_j\rangle = \langle A_i \rangle \langle A_j \rangle = \langle A \rangle^2,\] and the second term in Eq. \ref{eq:delta1} vanishes. Since we are exploring the basin using a MCMC our samples are far from being independent and a more accurate estimate for the statistical error can be derived under the assumption of fast decay as \(|i-j|\rightarrow \infty\). With some algebra (Ambegaokar 2010, Frenkel 2002) it can be shown that the second term on the RHS of Eq. \ref{eq:delta1} is equal to \[\label{eq:delta2}
\frac{1}{M^2}\sum_{i \neq j = 1}^M (\langle A_i A_j\rangle = \langle A \rangle^2) \approx \frac{2}{M}\sum_{t = 1}^{\infty}(\langle A_1 A_{1+t}\rangle- \langle A\rangle^2) \equiv \frac{2}{M} \text{Var}A \tau_A,\] where we have used the definition of the *integrated autocorrelation time* of A \[\tau_A = \frac{\sum_{t=1}^{\infty}(\langle A_1 A_{1+t}\rangle- \langle A\rangle^2)}{\langle A^2\rangle- \langle A\rangle^2}.\] Substituting Eq. (\ref{eq:delta2}) into Eq. (\ref{eq:delta1}) we obtain an error estimate that takes into account the correlation effects: \[\label{eq:delta3}
\Delta_A^2 = \frac{\text{Var}A}{M} (1 + 2 \tau_A)\] where \(\xi_A = 1 + 2 \tau_A\) is the *statistical inefficiency* and \(M/\xi_A < M\) is the effective number of uncorrelated samples. We can compute the integrated autocorrelation time and hence the statistical inefficiency using the `pymbar</`

## Share on Social Media