Deep learning


In the last couple of years a new approach has been used with great success, called deep learning. Deep learning usually consists of neural networks with at least 3 layers, which are used to build hierarchical representations of data. Some of the deep learning methods are unsupervised and they are used to learn better features that are than used to solve a machine learning problem. Other deep learning methods are supervised and can be used directly to solve classification problems. Deep learning systems have been used to obtain the state of the art in different problems, such as object recognition in images or speech recognition, tested on standard datasets such as MNIST, CIFAR-10, Pascal VOC and TIMIT.


Deep learning represents a category of machine learning algorithms that differ from previous ones by not learning the desired function directly, but they learn how to process the input data first. Until now, for example, to recognize objects in images, researchers developed various feature extractors, the more specific to an object the better, and after applying them to an image, they used a classification algorithm such as an SVM or a Random Forest to determine what was in each image. For some objects there are really good feature extractors (for faces, for example), but for other less common items, there are no good feature extractors and it would take too long time for humans to manually develop such things for different items. But deep learning algorithms don’t require this feature extraction step because they learn to do it themselves.

The deep part of the name comes from the fact that instead of having a single layer that receives the input data and outputs the desired result, we have a series of layers that process data received from the previous layer, extracting higher and higher levels of features. Only the last layer is used to obtain the result, after the data has been transformed and compressed.

Deep neural networks have been succesfully used by various research groups. Dahl et al. from Microsoft presented a paper about using deep neural nets for speech recogntion (Dahl). The following year, Quoc et al. from Google developed a system that, from 10 million YouTube thumbnails, learned by itself to recognize human faces (and 22.000 other categories of objects) (Le).

The aim of this paper is to present some of the main aspects of deep learning, that set it apart from traditional machine learning algorithms and that make it possible for it to perform much better in some cases.

The rest of the paper is structured as follows: section \ref{sec:unsupervised} presents the two main unsupervised deep learning approaches, while section \ref{sec:supervised} presents some of the improvements used in the supervised setting.

Unsupervised pretraining


Restricted Boltzmann Machines

Deep learning had its first major success in 2006, when Geoffrey Hinton and Ruslan Salakhutdinov published a paper introducing the first efficient and fast training algorithm of Restricted Boltzmann Machines (RBMs)(Hinton 2006).

As the name suggests, RBMs are a type of Boltzmann machines, with some constraints. These have been proposed by Geoffrey Hinton and Terry Sejnowski(Ackley 1985) in 1985 and they were the first neural networks that could learn internal representations (models) of the input data and then use this representation to solve different problems (such as completing images with missing parts). They weren’t used for a long time because, without any constraints, the learning algorithm for the internal representation was very inefficient.

According to the definition, Boltzmann machines are generative stochastic recurrent neural networks. The stochastic part means that they have a probabilistic element to them and that the neurons that make up the network are not fired deterministically, but with a certain probability, determined by their inputs. The fact that they are generative means that they learn the joint probability of input data, which can then be used to generate new data, similar to the original one.

But there is an alternative way to interpret Boltzmann machines, as being energy based graphical models. This means that for each possible input we associate a number, called the energy of the model, and for the combinations that we have in our data we want this energy to be as low as possible, while for other, unlikely data, it should be high.

Graphical model for a Restricted Boltzmann Machine

The constraint imposed by RBMs is that neurons must form a bipartite graph, which in practice is done by organizing them into two separate layers, a visible one and a hidden one, and the neurons from each layer have connections to the neurons in the other layer and not to any neuron in the same layer. In the above figure, you can see that there are no connections between any of the h’s, nor any of the v’s, only between every v with every h.

The hidden layer of the RBM can be thought to be made of latent factors that determine the input layer. If, for example, we analyze the grades users give to some movies, the input data will be the grades given by a certain user to the movies, and the hidden layer will correspond to the categories of movies. These categories are not predefined, but the RBM determines them while building its internal model, grouping the movies in such a way that the total energy is minimized. If the input data are pixels, then the hidden layer can be seen as features of objects that could generate those pixels (such as edges, corners, straight lines and other differentiating traits).

If we regard the RBMs as energy based models, we can use the mathematical apparatus used by statistical physics to estimate the probability distributions and then to make predictions. Actually, the Boltzmann distribution from modeling the atoms in a gas gave the name to these neural networks.

The energy of such a model, given the vector v (the input layer), the vector h (the hidden layer), the matrix W (the weights associated with the connections between each neuron from the input layer and the hidden one) and the vectors a and b (which represent the activations thresholds for each neuron, from the input layer and from the hidden layer) can be computed using the following formula:

\[E(v,h) = -\sum_i a_i v_i - \sum_j b_j h_j -\sum_i \sum_j h_j w_{i,j} v_i\]

Once we have the energy for a state, its probability is given by:

\[P(v,h) = \frac{1}{Z} e^{-E(v,h)}\]

where Z is a normalization factor.

And this