Abstract

Smoothness regularization is a popular method to decrease generalization error. We propose a novel regularization technique that rewards local distributional smoothness (LDS), a KL-distance based measure of the model’s robustness against perturbation. The LDS is defined in terms of the direction to which the model distribution is most sensitive. Our technique closely resembles the adversarial training (Goodfellow 2015), but distinguishes itself in that it determines the adversarial direction from the model distribution alone, and does not use the information from labeled data. The technique is therefore applicable to semi-supervised training. When we applied our technique to the classification problem on permutation invariant MNIST, it not only eclipsed all the models that are not dependent on generative models and pre-training, but also performed well even in comparison to the state of the art method (Rasmus 2015) that uses highly advanced generative models.

Classification, Neural networks, Regularization, Deep learning, Supervised Learning, Semi-supervised Learning, Adversarial examples

Overfitting is an unavoidable challenge in supervised and semi-supervised training of the classification and regression functions. When the training dataset is finite, the training error that is defined as the average of the log-likelihood computed from the training set is bound to be different from the test error that is defined as the expectation of the log-likelihood with respect to the true underlining probability measure. The asymptotic analysis(Akaike 1998, Watanabe 2009) is useful in determining the extent of the diversion of the training error from the test error. Unfortunately, asymptotic analysis of these kind only aims to describe the asymptotic behavior of the expectation alone. They are unable to completely resolve the indeterminacy of the true distribution.

The most major countermeasure against overfitting is the inclusion of the regularization function into the cost function that originally consists of the log-likelihood function alone. Because the optimization process based on regularized cost function aims to optimize both terms, regularization essentially reduces the effect of the log-likelihood term on the model selection, and hence the effect of the size of the training dataset. \(L_{2}\) and \(L_{1}\) regularizations are popular methods (Friedman 2001). However, except for the case of simple models like linear regression, the effects of the \(L_{2}\) and \(L_{1}\) regularization terms on the model distribution can be complex. This is especially true for the complex models like deep neural network. Moreover, reparametrization alters effect of the \(L_{2}\) and \(L_{1}\) regularization on the model distribution, and hence the identity of the model distribution that optimizes the loss function. For the ultimate goal of decreasing the generalization error, regularization function that reflects our belief on the true distribution is more ideal. In many familiar subjects like image and time-series analysis smooth model distribution tends to perform better than the nonsmooth distributions (Wahba 1990) in terms of the generalization error. Throughout this article, we will therefore adopt this belief on the ’good model’. We also seek the the loss function \(T(\theta)\) that is parameter invariant. That is, if \(\theta^{*}\) is the parameter that minimizes the loss function and if \(\eta=f(\theta)\), the reparametrized loss function \(\tilde{T}(\eta)\) should take the minimum value at \(f(\theta^{*})\).

Adversarial training (Goodfellow 2015) is the newest brand of parameterization invariant smoothness regularization. It is a method that aims to improve the local smoothness of the model in the neighborhood of every observed datapoint. At each point of the training, Goodfellow et al. identified for each pair of the observed input and output the direction of the input-perturbation to which the classifier’s training label assignment is most sensitive. Goodfellow et al. then penalized the model’s sensitivity with respect to the perturbations in the adversarial directions.

We propose a novel parameter invariant smoothness regularization that builds on the philosophy of adversarial training. At each step in the training, we identify for each observed input the perturbation of the input-perturbation to which the model distribution itself is most sensitive in the sense of Kullback-Leibler divergence (KL divergence). Our adversarial perturbation is virtual in that it is determined without the training label. At each input, we can therefore define the local robustness of the model distribution against the perturbation in virtual adversarial direction. The local robustness defined this way serves as a measure of the local smoothness of the distribution, or local distributional smoothness (LDS). We propose Virtual Adversarial Training (VAT), a simple regularization technique that rewards the average of the LDS over all training input.

Likewise adversarial training, the VAT is also a ’parametrization invariant’ regularization technique. We applied the VAT to the classification problem on the permutation invariant MNIST dataset. Our method not only outperformed all the models that are not dependent on generative models and pre-training, but it also pe