Deep learning

AbstractIn the last couple of years a new approach has been used with great success, called deep learning. Deep learning usually consists of neural networks with at least 3 layers, which are used to build hierarchical representations of data. Some of the deep learning methods are unsupervised and they are used to learn better features that are than used to solve a machine learning problem. Other deep learning methods are supervised and can be used directly to solve classification problems. Deep learning systems have been used to obtain the state of the art in different problems, such as object recognition in images or speech recognition, tested on standard datasets such as MNIST, CIFAR-10, Pascal VOC and TIMIT.


Deep learning represents a category of machine learning algorithms that differ from previous ones by not learning the desired function directly, but they learn how to process the input data first. Until now, for example, to recognize objects in images, researchers developed various feature extractors, the more specific to an object the better, and after applying them to an image, they used a classification algorithm such as an SVM or a Random Forest to determine what was in each image. For some objects there are really good feature extractors (for faces, for example), but for other less common items, there are no good feature extractors and it would take too long time for humans to manually develop such things for different items. But deep learning algorithms don’t require this feature extraction step because they learn to do it themselves.

The deep part of the name comes from the fact that instead of having a single layer that receives the input data and outputs the desired result, we have a series of layers that process data received from the previous layer, extracting higher and higher levels of features. Only the last layer is used to obtain the result, after the data has been transformed and compressed.

Deep neural networks have been succesfully used by various research groups. Dahl et al. from Microsoft presented a paper about using deep neural nets for speech recogntion (Dahl). The following year, Quoc et al. from Google developed a system that, from 10 million YouTube thumbnails, learned by itself to recognize human faces (and 22.000 other categories of objects) (Le).

The aim of this paper is to present some of the main aspects of deep learning, that set it apart from traditional machine learning algorithms and that make it possible for it to perform much better in some cases.

The rest of the paper is structured as follows: section \ref{sec:unsupervised} presents the two main unsupervised deep learning approaches, while section \ref{sec:supervised} presents some of the improvements used in the supervised setting.

Unsupervised pretraining


Restricted Boltzmann Machines

Deep learning had its first major success in 2006, when Geoffrey Hinton and Ruslan Salakhutdinov published a paper introducing the first efficient and fast training algorithm of Restricted Boltzmann Machines (RBMs)(Hinton 2006).

As the name suggests, RBMs are a type of Boltzmann machines, with some constraints. These have been proposed by Geoffrey Hinton and Terry Sejnowski(Ackley 1985) in 1985 and they were the first neural networks that could learn internal representations (models) of the input data and then use this representation to solve different problems (such as completing images with missing parts). They weren’t used for a long time because, without any constraints, the learning algorithm for the internal representation was very inefficient.

According to the definition, Boltzmann machines are generative stochastic recurrent neural networks. The stochastic part means that they have a probabilistic element to them and that the neurons that make up the network are not fired deterministically, but with a certain probability, determined by their inputs. The fact that they are generative means that they learn the joint probability of input data, which can then be used to generate new data, similar to the original one.

But there is an alternative way to interpret Boltzmann machines, as being energy based graphical models. This means that for each possible input we associate a number, called the energy of the model, and for the combinations that we have in our data we want this energy to be as low as possible, while for other, unlikely data, it should be high.