Authorea

Model Compression

Deep neural networks have recently achieved great progress on a wide set of applications, including image classification \cite{krizhevsky2012imagenet}, object detection \cite{girshick2015fast} and speech recognition \cite{hinton2012deep}. More and more, these networks are trained on large-scale compute clusters or high-performance graphics processing units. However, not only is the training procedure of these state-of-the-art networks computationally expensive, but the models get so large that the model alone can exceed the available memory on mobile devices. This prohibits running such models on smartphones. One solution to this problem could be to perform the prediction remotely on a server. However, this approach only works when network bandwidth is available and it incurs delays through network traffic. Further, simply training smaller models to be deployed on mobile devices \cite{chun2009augmented} is problematic as it significantly reduces accuracy.

This dilemma motivates a stream of research that focuses on reducing the size of neural networks without significantly reducing accuracy. This direction has been supported by recent work \cite{denil2013predicting} that demonstrates that the weights in neural networks have a surprisingly high amount of redundancy. In early work \cite{lecun1989optimal} introduced a way to drop unimportant weights. However, this approach requires additional parameters to store the sparse weights and additional time to fine-tune the resulting architecture. In a different direction, \cite{gupta2015deep} train neural networks with reduced bit precision in order to compress the model size. \cite{bucilua2006model}, \cite{ba2014deep} and \cite{hinton2015distilling} show a different approach to compress neural networks, by learning a “distilled” model. In particular a compact neural network is trained on a weighted combination of the original labels and softmax output of a larger model. Lastly, and closest to our work is the HashedNet approach proposed by \cite{hashnets} that compresses neural networks by reducing the set of unique parameters. Every parameter is drawn, by computing its hash, from a small set of unique values. The authors show that this approach largely sustains the accuracy of the network while significantly reducing its model size. However, for this approach new parameters need to be loaded for each weight in the weight matrix during matrix-matrix products. This results in slow classification. In this work we design and develop a fast implementation for the product of a hashed weight matrix with an input matrix (a batch of input vectors) that would significantly improve the classification speed of HashedNets.