For practice, we’ll first derive the EM algorithm for the Mixture of Gaussians (MoG) model.

The typical Gaussian distribution

\begin{equation} p(x)=\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}\\ \end{equation}can be rewritten in terms of its natural exponential family form

\begin{equation} p(x)=\frac{1}{\sqrt{2\pi}}e^{\frac{\mu}{\sigma^{2}}x-\frac{1}{2\sigma^{2}}x^{2}−\log\sigma−\frac{\mu^{2}}{2\sigma^{2}}}=\frac{1}{\sqrt{2\pi}}e^{\alpha x+\beta x^{2}+\frac{1}{2}\log(-2\beta)+\frac{\alpha^{2}}{4\beta}}\\ \end{equation}where \(\alpha=\frac{\mu}{\sigma^{2}}\) and \(\beta=-\frac{1}{2\sigma^{2}}\) and inversely \(\mu=-\frac{\alpha}{2\beta}\) and \(\sigma=\sqrt{-\frac{1}{2\beta}}\).

One simple model that has more flexibility than a single Gaussian distribution, but is still fairly easy to reason about are Mixture of Gaussian models. These models posit that each data point is drawn from one of a set of Gaussian distributions with different parameters. Now in addition to the parameters of the Gaussians, we also have \(\pi_{i}\) the probability that a data point is drawn from Gaussian (or class) \(i\). The joint model for a data point, \(x\), and class, \(c\), is

\begin{equation} p(x,c)=\prod_{i}(\pi_{i}\mathcal{N}(x,\alpha_{i},\beta_{i}))^{\delta_{c,c_{i}}}\\ \end{equation}and the likelihood is

\begin{equation} p(x)=\sum_{i}\pi_{i}\mathcal{N}(x,\alpha_{i},\beta_{i}).\\ \end{equation}Expectation maximization is a fairly straightforward idea for doing maximum likelihood learning which gives exact updates for MoG. The basic idea is as follows. The log-likelihood for a model with hidden variables, \(h\), and visible variables, \(x\) will be of the form:

\begin{equation} \log p(x)=\log\sum_{h}p(x,h)=\log\sum_{h}p(x,h)\frac{p(h|x)}{p(h|x)}.\\ \end{equation}We can turn this into an expectation and use Jensen’s inequality to write:

\begin{equation} \log\sum_{h}p(x,h)\frac{p(h|x)}{p(h|x)}=\log(\mathbb{E}_{p(h|x)}[\frac{p(x,h)}{p(h|x)}])\geq\mathbb{E}_{p(h|x)}[\log(\frac{p(x,h)}{p(h|x)})].\\ \end{equation}Now, if we consider \(p(h|x)\) to be a fixed distribution, i.e. not dependent on the parameters being estimated, for one parameter update, we can drop the term from the denominator and this simplifies to

\begin{equation} \mathbb{E}_{p(h|x)}[\log(p(x,h))]\\ \end{equation}which will be the objective for learning. It turns out that \(p(h|x)\) is the optimal distribution over which to take this expectation (we could have multiplied and divided by any distribution), but I’m not sure that this detail aids intuitive understanding.