authorea.com/77614

CudaHashedNet Midterm Report

As available datasets increase in size, machine learning models can successfully use more and more parameters. In applications such as computer vision, models with up to 144 million (Simonyan 2014) parameters are not uncommon and reach state-of-the-art performance. Experts can train and deploy such models on large machines, but effective use of lower-resource hardware such as commodity laptops or even mobile phones remains a challenge.

One way to address the challenge of large models is through model compression using hashing (Chen 2015). In general, this amounts to reducing a parameter set \(S = \{s_0, s_1, ... s_D\}\) to a greatly reduced set \(R = \{r_0, r_1, ..., r_d\}\) with \(d \ll D\) by randomly tying parameters to hash buckets (\(s_i = r_{h(i)}\)). This turned out to perform very well for neural networks, leading to the so-called HashedNets.

Many machine learning models involve several linear projections representable by matrix-vector products \(W \cdot x\) where x is input data and \(W\) consists of model parameters. In most such models, this linear algebra operation is the performance bottleneck; neural networks, in particular, chain a large number of matrix-vector products, intertwined with non-linearities. In terms of dimensionality, modern systems deal with millions training samples \(x_i\) lying in possibly high-dimensional spaces. The shape of \(W\), (\(d_{out}\), \(d_{in}\)), depends on how deep a layer is in a network: at the first layer, \(d_{in}\) depends on the data being processed, while \(d_{out}\) at the final layer depends on the desired system output (i.e., \(d_{out}=1\) for binary classification, and \(d_{out}=p\) if the output can fall in \(p\) classes). In middle layers, dimensionality is up to the model designer, and increasing it can make the model more powerful but bigger and slower. Notably, middle layers often have square \(W_h\). When \(W\) is stored in a reduced hashed format \(W_h\), many common trade-offs may change.

The goal of our project is to explore the performance bottlenecks of the \(W_h \cdot x\) operation where \(W_h\) is a hashed representation of an array that stays constant for many inputs \(x_i\). Since neural networks are typically trained with batches of input vectors \(x\) concatenated into an input matrix \(X\), we will look at the general case of matrix-matrix multiplication, where the left matrix is in a reduced hashed format \(W_h \cdot X\).

Taking advantage of massively parallel GPU architecture can be important even when dealing with smaller models. In March 2015, Nvidia announced a SoC for mobile devices with a GPU performance of 1 teraflop, the Tegra X1 (NVIDIA 2015); we foresee future mobile devices to have stronger and stronger GPUs.

The objectives of our project are to:

Investigate fast applications of \(W_h\cdot X\) when \(W_h\) is small enough to be fully loaded into memory. In this case, is it faster to first materialize the hashed array and use existing fast linear algebra routines? Can the product be computed faster on a GPU with minimal memory overhead? This can lead to highly efficient deployment of powerful models on commodity hardware or phones.

Analyze performance when even after hashing \(W_h\) is too big. In the seminal work that popularized the usage large scale deep convolutional neural networks and training using the GPU (Krizhevsky 2012), Krizhevsky predicts that GPUs with more memory can lead to bigger networks with better performance. Hashing-based compression can help practitioners prototype very large models on their laptops before deciding which configuration to spend cloud computing resources on. Can we make HashedNets training on GPUs efficient? This may involve forcing a predictable split on the hash function to allow for independent division of work.

Graphics processing units (GPUs) have evolved, driven by the video game market, to perform more specific kinds of computation faster than the general purpose CPU. They are particularly well suited for *data-parallel computing* in a single-instruction multiple-device (SIMD) fashion (Flynn 1966). The advantages of such a pipeline architecture for linear algebra operations have been noted early (Heller 1978). Today, with the commoditization of powerful GPUs, the scientific computing community has developed multiple libraries and toolkits for GPU-enabled linear algebra, including cuBLAS (NVIDIA 2008), CULA (Humphrey 2010), Magma (Agullo 2009), ViennaCL (Rupp 2010). Paralleling the basic linear algebra subsystem (BLAS) (Lawson 1979), such libraries implement general matrix-vector and matrix-matrix multiplication operators (gemv, gemm). CUDA documentation (NVIDIA 2015) describes a simple implementation of gemm that takes advantage of the device memory structure; there are research efforts dedicated to exploring the bottlenecks and optimizing matrix multiplication in general (Volkov 2008) as well as for specific cases such as sparse matrices (Baskaran 2008, Bell 2008), but we are not aware of research on implicitly-defined matrices with repeated values. In fact, (Fatahalian 2004) argue that matrix-matrix multiplication can be suboptimal on GPUs because the higher cost of data fetching; our work will investigate whether this problem can be significantly alleviated through hash-based data compression.

Deep neural networks have recently achieved great progress on a wide set of applications, including image classification (Krizhevsky 2012), object detection (Girshick 2015) and speech recognition (Hinton 2012). More and more, these networks are trained on large-scale compute clusters or high-performance graphics processing units. However, not only is the training procedure of these state-of-the-art networks computationally expensive, but the models get so large that the model alone can exceed the available memory on mobile devices. This prohibits running such models on smartphones. One solution to this problem could be to perform the prediction remotely on a server. However, this approach only works when network bandwidth is available and it incurs delays through network traffic. Further, simply training smaller models to be deployed on mobile devices (Chun 2009) is problematic as it significantly reduces accuracy.

This dilemma motivates a stream of research that focuses on reducing the size of neural networks without significantly reducing accuracy. This direction has been supported by recent work (Denil 2013) that demonstrates that the weights in neural networks have a surprisingly high amount of redundancy. In early work (LeCun 1989) introduced a way to drop unimportant weights. However, this approach requires additional parameters to store the sparse weights and additional time to fine-tune the resulting architecture. In a different direction, (Gupta 2015) train neural networks with reduced bit precision in order to compress the model size. (Buciluǎ 2006), (Ba 2014) and (Hinton 2015) show a different approach to compress neural networks, by learning a “distilled” model. In particular a compact neural network is trained on a weighted combination of the original labels and softmax output of a larger model. Lastly, and closest to our work is the HashedNet approach proposed by (Chen 2015) that compresses neural networks by reducing the set of unique parameters. Every parameter is drawn, by computing its hash, from a small set of unique values. The authors show that this approach largely sustains the accuracy of the network while significantly reducing its model size. However, for this approach new parameters need to be loaded for each weight in the weight matrix during matrix-matrix products. This results in slow classification. In this work we design and develop a fast implementation for the product of a hashed weight matrix with an input matrix (a batch of input vectors) that would significantly improve the classification speed of HashedNets.

## Share on Social Media