# Mixture of Rayleigh Distributions

## Mixture of Gaussians

For practice, we’ll first derive the EM algorithm for the Mixture of Gaussians (MoG) model.

### Natural Parameterization

The typical Gaussian distribution

$$p(x)=\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}\\$$

can be rewritten in terms of its natural exponential family form

$$p(x)=\frac{1}{\sqrt{2\pi}}e^{\frac{\mu}{\sigma^{2}}x-\frac{1}{2\sigma^{2}}x^{2}−\log\sigma−\frac{\mu^{2}}{2\sigma^{2}}}=\frac{1}{\sqrt{2\pi}}e^{\alpha x+\beta x^{2}+\frac{1}{2}\log(-2\beta)+\frac{\alpha^{2}}{4\beta}}\\$$

where $$\alpha=\frac{\mu}{\sigma^{2}}$$ and $$\beta=-\frac{1}{2\sigma^{2}}$$ and inversely $$\mu=-\frac{\alpha}{2\beta}$$ and $$\sigma=\sqrt{-\frac{1}{2\beta}}$$.

## Mixture models

One simple model that has more flexibility than a single Gaussian distribution, but is still fairly easy to reason about are Mixture of Gaussian models. These models posit that each data point is drawn from one of a set of Gaussian distributions with different parameters. Now in addition to the parameters of the Gaussians, we also have $$\pi_{i}$$ the probability that a data point is drawn from Gaussian (or class) $$i$$. The joint model for a data point, $$x$$, and class, $$c$$, is

$$p(x,c)=\prod_{i}(\pi_{i}\mathcal{N}(x,\alpha_{i},\beta_{i}))^{\delta_{c,c_{i}}}\\$$

and the likelihood is

$$p(x)=\sum_{i}\pi_{i}\mathcal{N}(x,\alpha_{i},\beta_{i}).\\$$

## The EM algorithm for MoG

### Introduction to EM

Expectation maximization is a fairly straightforward idea for doing maximum likelihood learning which gives exact updates for MoG. The basic idea is as follows. The log-likelihood for a model with hidden variables, $$h$$, and visible variables, $$x$$ will be of the form:

$$\log p(x)=\log\sum_{h}p(x,h)=\log\sum_{h}p(x,h)\frac{p(h|x)}{p(h|x)}.\\$$

We can turn this into an expectation and use Jensen’s inequality to write:

$$\log\sum_{h}p(x,h)\frac{p(h|x)}{p(h|x)}=\log(\mathbb{E}_{p(h|x)}[\frac{p(x,h)}{p(h|x)}])\geq\mathbb{E}_{p(h|x)}[\log(\frac{p(x,h)}{p(h|x)})].\\$$

Now, if we consider $$p(h|x)$$ to be a fixed distribution, i.e. not dependent on the parameters being estimated, for one parameter update, we can drop the term from the denominator and this simplifies to

$$\mathbb{E}_{p(h|x)}[\log(p(x,h))]\\$$

which will be the objective for learning. It turns out that $$p(h|x)$$ is the optimal distribution over which to take this expectation (we could have multiplied and divided by any distribution), but I’m not sure that this detail aids intuitive understanding.