Authorea

Roland Szabo edited autoenc2.tex almost 10 years ago

Commit id: 8cfe0ad5ca2edb97d85c9df4f9b4c83241f8aa3a

deletions | additions

The first function will do the encoding of the input: $ $$ h = f(x) = s_f(Wx+b_h) $ $$ where $ s_f $ is a nonlinear activation function, used by the hidden layer, W represents the weights of the connections between the visible layer and the hidden one, while bh is the bias unit for the input layer. The second function will decode the data: $ $$ y = g(h) = s_g(W'x+b_y) $ $$ where the constants have a similar meaning, but they are between the hidden layer and the output layer this time.

To quantify the error we make with regard to the identity function we can use the L2-norm of the difference: $ $$ L(x, y) = ||x-y||^2 $ $$ The parameters of the autoencoder are chosen as to minimise this value.

Sparse autoencoders impose the constraint that each neuron should be activated as rarely as possible. The hidden layer neurons are activated when the features they represent are present in the input data. If each neuron is rarely activated, then each one has distinct feature. In practice this is done by adding a penalty term for each activation of a neuron on the input data: $ $$ \beta \sum_i \log \frac{\rho}{\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\rho_j} $ $$ where $ \beta $ controls the amount of penalty for an activation, $\rho_j$ is the average activation of neuron j, and $\rho$ is the sparsity parameter which represents how often we want each neuron to be activated. Usually it has a value below 0.1.

Another variant of autoencoders are contractive autoencoders\cite{rifai2011contractive}. These try to learn efficient features by penalizing the sensibility of the network towards its inputs. Sensibility represents how much the output changes when the input changes a little. The smaller the sensibility, the more similar inputs will give similar features. For example, lets imagine the task of recognizing handwritten digits. Some people draw the 0 in a elongated way, others round it out, but the differences are small, of only a couple of pixels. We would want our network to learn the same features for all 0s. Penalizing the sensibility is done with the Frobenius norm of the Jacobian of the function between the input layer and the hidden one: $ $$ ||J_f(x)||_F^2 = \sum_{i,j} (\frac{\partial h_j(x)}{\partial{x_i}})^2 $ $$ All these models can be combined, of course. We can impose both sparsity constraints and corrupt the input data. Which of these techniques is better depends a lot on the nature of the data, so you must experiment with various kinds of autoencoders to see which gives you the best results.