Roland Szabo added rbm2.tex  almost 10 years ago

Commit id: b686c661870205cb7085bf7f493e6f227857fc77

deletions | additions      

         

The constraint imposed by RBMs is that neurons must form a bipartite graph, which in practice is done by organizing them into two separate layers, a visible one and a hidden one, and the neurons from each layer have connections to the neurons in the other layer and not to any neuron in the same layer. In the above figure, you can see that there are no connections between any of the h’s, nor any of the v’s, only between every v with every h.   The hidden layer of the RBM can be thought to be made of latent factors that determine the input layer. If, for example, we analyze the grades users give to some movies, the input data will be the grades given by a certain user to the movies, and the hidden layer will correspond to the categories of movies. These categories are not predefined, but the RBM determines them while building its internal model, grouping the movies in such a way that the total energy is minimized. If the input data are pixels, then the hidden layer can be seen as features of objects that could generate those pixels (such as edges, corners, straight lines and other differentiating traits).   If we regard the RBMs as energy based models, we can use the mathematical apparatus used by statistical physics to estimate the probability distributions and then to make predictions. Actually, the Boltzmann distribution from modeling the atoms in a gas gave the name to these neural networks.   The energy of such a model, given the vector v (the input layer), the vector h (the hidden layer), the matrix W (the weights associated with the connections between each neuron from the input layer and the hidden one) and the vectors a and b (which represent the activations thresholds for each neuron, from the input layer and from the hidden layer) can be computed using the following formula:  $ E(v,h) = -\sum_i a_i v_i - \sum_j b_j h_j -\sum_i \sum_j h_j w_{i,j} v_i $  Once we have the energy for a state, its probability is given by:   $ P(v,h) = \frac{1}{Z} e^{-E(v,h)} $  where Z is a normalization factor.   And this is where the constraints from the RBM help us. Because the neurons from the visible layer are not connected to each other, it means that for a given value of the hidden layer neuron, the visible ones are conditionally independent of each other. Using this we can easily get the probability for some input data, given the hidden layer:  $ P(v|h) = \prod_{i=1}^m P(v_i|h) $  where $ P(v|h) $ is the activation probability for a single neuron:  $ P(h_j=1|v) = \sigma \left(b_j + \sum_{i=1}^m w_{i,j} v_i \right) $  $ \sigma = \frac{1}{1+e^{-1}} $ is the logistic function.   In a similar way we can define the probability for the hidden layer, having the visible layer fixed.  $ P(h|v) = \prod_{j=1}^n P(h_j|v) $  $ P(v_i=1|h) = \sigma \left(a_i + \sum_{j=1}^n w_{i,j} h_j \right) $  How does it help us if we know these probabilities?  Let’s presume that we know the correct values for the weights and the thresholds of an RBM and that we want to determine what items are in an image. We set the pixels of the image as the input of the RBM and we calculate the activation probabilities of the hidden layer. We can interpret these probabilities as filters learned by the RBM about the possible objects in the images.   We take the values of those probabilities and we enter them into another RBM as input data. This RBM will also give out some other probabilities for its hidden layer, and these probabilities are also filters for its own inputs. These filters will be of a higher level and more complex. We repeat this a couple of times, we stack the resulting RBMs and, on top of the last one, we add a classification layer (such as logistic regression) and we get ourselves a Deep Belief Network\cite{Hinton_Teh_2006}.   The idea that started the deep learning revolution was this: you can learn layer by layer filters that get more and more complex and at the end you don’t work directly with pixels, but with high level features, that are much better indicators of what objects are there in an image.   The learning of the parameters of a RBM is done using an algorithm called “contrastive divergence”. This starts with an example from the input data, calculates the values for the hidden layer and then these values are used to simulate what input data they would produce. The weights are then adjusted with the difference between the original input data and the “dreamed” input data (with some inner products around there). This process is repeated for each example of the input data, several times, until either the error is small enough or a predetermined number of iterations has passed.