authorea.com/110223

Unsupervised learning: Clustering and density estimation

When it comes to determining and explaining information within a large, complicated, or multi-dimensional dataset, it can often be difficult to see patterns and relationships. In the case of supervised learning, where there is input data and output data, it is possible to artificially construct and optimize a model that can, with time and several iterations, predict and improve performance through its own experience by splitting the input data into training and testing sets. Naturally, supervised learning can yield a lot of information and results; however, in the case of unsupervised learning, or when output data is not available, many other methods exist to try and capture the natural structure of the data and make useful observations. This report will attempt to use unsupervised methods to try and infer further information about the cars dataset that was used in the previous exercise.

Despite having a K-means algorithm as one of several tools to perform unsupervised clustering, it was decided that a Gaussian Mixture Model was more appropriate for the cars dataset due to its ability to not only model the shape and size of the clusters, but also because of the potential for optimizing the number of components via cross-validation.

The method behind the GMM is essentially an Expectation Maximization algorithm. In other words, after performing a cross-validation to determine how many clusters there should be in order to minimize errors, the probability for every point belonging to each possible distribution is computed, followed by an estimation of the maximum likelihood parameters from every probability distribution. Over several repetitions, these parameters do not change and become constant, hopefully showing the correct clustering of the parameters within the dataset.

The expectation step can be summarized by the following equation,

\[p(z_n = k|x_n) = \frac{w_k\mathcal{N}(x|\mu_{(k)},\Sigma _{(k)})}{\Sigma^K_{K=1} w_k\mathcal{N}(x|\mu_{(k)},\Sigma _{(k)})}\] while the maximization step can be described by,

\[\Sigma_{(k)} = \frac{1}{N_k}\Sigma^N_{n=1}(x_n - \mu_{(k)})(x_n - \mu_{(k)})^T p(z_n = k|x_n)\] where,

\[\begin{aligned}
N_k &=& \Sigma^N_{n=1} p(z_n = k|x_n),\\
\mu_{(k)} &=& \frac{1}{N_k}\Sigma^N_{n=1} x_np(z_n = k|x_n),\\
w_k &=& \frac{N_k}{N}.\\\end{aligned}\]

Apart from cross-validation, the optimal number of clusters is sometimes derived by Bayesian Information Criteria (BIC) or by Akaike’s Information Criteria (AIC), which penalizes the model complexity through the relations,

\[\begin{aligned}
BIC &=& -2 \log L + p \log(N),\\
AIC &=& -2 \log L + 2p,\\\end{aligned}\] where \(p\) is the number of free parameters in the model. The lowest BIC and AIC values can both be used to find a trade off between modeling the data well and penalizing the complexity of the model. The BIC and AIC, as well as the cross validation results, can be seen in the following figure.

## Share on Social Media