probability distribution

The end result of this project will be a series of volumes sampled from an unknown probability distribution. We want to determine what that distibution will be. We will have a functional form for that distribution \(P(V | \theta )\) where \(\theta\) are the unknown parameters of the distibution. In the past we’ve used a generalized Gaussian. \(P(V | \theta)\) is the biased distribution, so it will be the generalized gaussian times \(V\). This is mostly taken from http://en.wikipedia.org/wiki/Distribution_fitting

There are several ways to determine what the values of \(\theta\) should be. None of which involve histogramming.

maximum likelihood method

http://en.wikipedia.org/wiki/Maximum_likelihood

The likelihood is the joint probability distribution of all the observations given the set of parameters \(\theta\) \[L(V_1, ..., V_N | \theta) = \prod_{i=1}^{N} P(V_i | \theta)\] If we maximize this (actually the log of this) we get the maximum likelihood estimate for \(\theta\).

regression on the cumulative distribution

This seems fairly ststraightforward. I suspect it reduces to maximum likelihood methods

moment fitting

We record a number of moments of the observed variables, e.g. \( \langle V \rangle\), \( \langle V^2 \rangle\), etc. We then fit analytical moments of \(P(V|\theta)\) to the observed moments. The advantage of this is that you don’t need to save every data point, just the running averages. The downside is obviously that you’re only fitting a few parameters. This might be good if you have too many observations to realistically record them all.

Bayesian updates

http://en.wikipedia.org/wiki/Bayesian_inference#Parametric_formulation

With this method you start with some prior distribution over the parameters \(P(\theta)\) and apply Bayes’ theorem iteratively to update the distribution given the additional knowledge (the fact of the observation). \[P(\theta | V_1) = \frac{ P(V_1 | \theta) P(\theta) }{ P(V_1)}\] The denomenator is simply the integral of the numerator over all possible values of \(\theta\) (kind of like the partition function). After applying all the updates we have \[P(\theta | V_1, \dots, V_N ) = \frac{ L(V_1, ..., V_N | \theta) P(\theta) }{ P(V_1, \dots, V_N)}\] where \(L\) is the likelihood as defined earlier. To find the best parameters \(\theta\) we then maximize the log probability. \[\log P(\theta | V_1, \dots, V_N ) = \log L(V_1, ..., V_N | \theta) + \log P(\theta) + \text{const}\]

The advantage of the Bayesian method is that it allows you to naturally incorporate prior information about the distribution of \(\theta\). At a minimum this allows you to place bounds, e.g. if some parameters must be positive. It also allows you to incorporate prior beliefs. If you think \(\theta_1\) should be a small parameter you can make \(P(\theta) \sim \exp(-\alpha \theta_1^2)\)

We can also use the principle of maximum entropy to obtain the prior probability distribution. This would be the prior that corresponds to the least assumptions on our part (the least informative prior). I suspect this would result in the flat distribution, but I’m not sure. Bishop §2.4.3 discusses non-informative priors. He says that a non-informative prior should reflect symmetries of the probability distribution. For example, the mean \(\mu\) of a Gaussian distribution is translationally invariant, so the prior on \(\mu\) should be as well, so the non-informative prior is the flat distribution. The lenth scale \(L\) of a Gaussian distribution is scale invariant, so the non-informative prior should be \(\sim 1/L\).

We might be able to formulate the requirement that \(P(V) \to 0\) for large \(V\) in a more flexible way using priors rather than the blunt way of specifing the class functional form of \(P(V | \theta)\).

Another advantage of the Bayesian method is that it gives you a full probability distribution over \(\theta\). Which is a lot more information than a simple maximum likelihood estimate. (I’m not sure if the likelihood function can be thought of as a probability distribution. I suspect that the Bayesian probability simply reduces to the Likelihood if the prior \(P(\theta)\) is flat. But I can’t say for sure.)

bayesian model comparison

Bishop §3.4 talks about Bayesian model comparison. In our situation, different models would correspond to different functional forms for \(P(V|\theta_i)\) with \(\theta_i\) being parameters of model \(i\) (\(M_i\)). If we define the data \(D = \{ V_1, ..., V_N \}\) then \[p(M_i | D) \propto p(M_i) p(D|M_i)\] where \(p(M_i)\) is the prior over \(p(M_i)\) which we will assume is flat. \(p(D|M_i)\) is known as the model evidence. It is also sometime called the marginal likelihood because it can be viewed as the likelihiood function over the space of models, in which the parameters of the models have been marginalized out (integrated out). The ratio of model evidences for two models is known as the Bayes factor.

Choosing the model with the highest evidence is called model selection. Even better is to combine multiple models (mixture distribution) \[p(V| D) = \sum_i p(V|D, M_i) p(M_i | D)\]

If the prior (\(p(\theta)\)) is improper (not normalizable) then the evidence is not defined. It may however, still be possible to compare different models by taking the ratio.

non-parametric formulation

If we want to approximate the probability density without specifying a functional form we can use density estimation http://en.wikipedia.org/wiki/Density_estimation (also see Bishop §2.5), especially kernal density estimation http://en.wikipedia.org/wiki/Kernel_density_estimation. This is a bit like histogramming with smoothing, but you end up with an analytic form for the density at the end. This can easily be done using http://scikit-learn.org/stable/modules/density.html