ROUGH DRAFT authorea.com/9594

# probability distribution

The end result of this project will be a series of volumes sampled from an unknown probability distribution. We want to determine what that distibution will be. We will have a functional form for that distribution $$P(V | \theta )$$ where $$\theta$$ are the unknown parameters of the distibution. In the past we’ve used a generalized Gaussian. $$P(V | \theta)$$ is the biased distribution, so it will be the generalized gaussian times $$V$$. This is mostly taken from http://en.wikipedia.org/wiki/Distribution_fitting

There are several ways to determine what the values of $$\theta$$ should be. None of which involve histogramming.

## maximum likelihood method

http://en.wikipedia.org/wiki/Maximum_likelihood

The likelihood is the joint probability distribution of all the observations given the set of parameters $$\theta$$ $L(V_1, ..., V_N | \theta) = \prod_{i=1}^{N} P(V_i | \theta)$ If we maximize this (actually the log of this) we get the maximum likelihood estimate for $$\theta$$.

## regression on the cumulative distribution

This seems fairly ststraightforward. I suspect it reduces to maximum likelihood methods

## moment fitting

We record a number of moments of the observed variables, e.g. $$\langle V \rangle$$, $$\langle V^2 \rangle$$, etc. We then fit analytical moments of $$P(V|\theta)$$ to the observed moments. The advantage of this is that you don’t need to save every data point, just the running averages. The downside is obviously that you’re only fitting a few parameters. This might be good if you have too many observations to realistically record them all.

http://en.wikipedia.org/wiki/Bayesian_inference#Parametric_formulation

With this method you start with some prior distribution over the parameters $$P(\theta)$$ and apply Bayes’ theorem iteratively to update the distribution given the additional knowledge (the fact of the observation). $P(\theta | V_1) = \frac{ P(V_1 | \theta) P(\theta) }{ P(V_1)}$ The denomenator is simply the integral of the numerator over all possible values of $$\theta$$ (kind of like the partition function). After applying all the updates we have $P(\theta | V_1, \dots, V_N ) = \frac{ L(V_1, ..., V_N | \theta) P(\theta) }{ P(V_1, \dots, V_N)}$ where $$L$$ is the likelihood as defined earlier. To find the best parameters $$\theta$$ we then maximize the log probability. $\log P(\theta | V_1, \dots, V_N ) = \log L(V_1, ..., V_N | \theta) + \log P(\theta) + \text{const}$

The advantage of the Bayesian method is that it allows you to naturally incorporate prior information about the distribution of $$\theta$$. At a minimum this allows you to place bounds, e.g. if some parameters must be positive. It also allows you to incorporate prior beliefs. If you think $$\theta_1$$ should be a small parameter you can make $$P(\theta) \sim \exp(-\alpha \theta_1^2)$$

We can also use the principle of maximum entropy to obtain the prior probability distribution. This would be the prior that corresponds to the least assumptions on our part (the least informative prior). I suspect this would result in the flat distribution, but I’m not sure. Bishop §2.4.3 discusses non-informative priors. He says that a non-informative prior should reflect symmetries of the probability distribution. For example, the mean $$\mu$$ of a Gaussian distribution is translationally invariant, so the prior on $$\mu$$ should be as well, so the non-informative prior is the flat distribution. The lenth scale $$L$$ of a Gaussian distribution is scale invariant, so the non-informative prior should be $$\sim 1/L$$.

We might be able to formulate the requirement that $$P(V) \to 0$$ for large $$V$$ in a more flexible way using priors rather than the blunt way of specifing the class functional form of $$P(V | \theta)$$.

Another advantage of the Bayesian method is that it gives you a full probability distribution over $$\theta$$. Which is a lot more information than a simple maximum likelihood estimate. (I’m not sure if the likelihood function can be thought of as a probability distribution. I suspect that the Bayesian probability simply reduces to the Likelihood if the prior $$P(\theta)$$ is flat. But I can’t say for sure.)

### bayesian model comparison

Bishop §3.4 talks about Bayesian model comparison. In our situation, different models would correspond to different functional forms for $$P(V|\theta_i)$$ with $$\theta_i$$ being parameters of model $$i$$ ($$M_i$$). If we define the data $$D = \{ V_1, ..., V_N \}$$ then $p(M_i | D) \propto p(M_i) p(D|M_i)$ where $$p(M_i)$$ is the prior over $$p(M_i)$$ which we will assume is flat. $$p(D|M_i)$$ is known as the model evidence. It is also sometime called the marginal likelihood because it can be viewed as the likelihiood function over the space of models, in which the parameters of the models have been marginalized out (integrated out). The ratio of model evidences for two models is known as the Bayes factor.

Choosing the model with the highest evidence is called model selection. Even better is to combine multiple models (mixture distribution) $p(V| D) = \sum_i p(V|D, M_i) p(M_i | D)$

If the prior ($$p(\theta)$$) is improper (not normalizable) then the evidence is not defined. It may however, still be possible to compare different models by taking the ratio.

## non-parametric formulation

If we want to approximate the probability density without specifying a functional form we can use density estimation http://en.wikipedia.org/wiki/Density_estimation (also see Bishop §2.5), especially kernal density estimation http://en.wikipedia.org/wiki/Kernel_density_estimation. This is a bit like histogramming with smoothing, but you end up with an analytic form for the density at the end. This can easily be done using http://scikit-learn.org/stable/modules/density.html

# Entropy computation

Once we have all of the above tools, we should be able to compute the entropy or total number of basins in several different ways. These correspond to different entropy definitons and analysis techniques.

## APF entropy

Description: This is the granular entropy according to Asenjo (Asenjo 2014)

Definition: $S_\text{APF}^* = \langle F \rangle_\text{biased} + \log(V_\text{acc})$ $S_\text{APF} = S_\text{APF}^* - \log(N!)$

Requirements: Needs the $$\log(V)$$ values as obtained from the basinvolume computation.

## Fit to CDF and Jackknife

Description: JackLogOmega gives the LogOmega entropy (log of number of basins), after unbiasing the distribution with fit to generalised gaussian CDF and numerical integration as in Asenjo14.

Definition: $S^\star = \log \Omega = \log(V_\text{acc}) - \log(\text{unbiased mean volume})$ $S = S^\star - \log(N!)$

Requirements: Needs an analytical (or, at least numerically integrable) expression for the un-biased basin volume distribution. Currently this is obtained by fitting a generalised gaussian to the empirical CDF.

## ML fit

3.) MLLogOmega gives LogOmega from ML estimate for LogOmega, after maximum likelihood fit of generalised gaussian.

## Bayesian

TODO: 4.) BayesianLogOmega uses Bayesian inference to get the parameters of the generalised gaussian.

note (jake): We should compute the entropy by averaging over all possible values of the parameters ($$\theta$$ above) weighted by the bayesian posterior. In other words, we should integrate out the parameters. This is preferable to the maximum likelihood result. $\text{estimate for the entropy} = \int \text{entropy}\, P(\theta | V_1,...,V_N) d \theta$

## Non-parametric

5.) NonParametricLogOmega constructs a non-parametric description of the biased distribution and integrates that with the un-biasing factor to get LogOmega without the assumption of the genealised gaussian. This needs some kernel density estimation or smoothing or similar to get the PDF description. Bandwidth selection is done using Silverman’s rule of thumb as initial guess for integrated squared error cross-validation (Bowman 1984).