CudaHashedNet Midterm Report

Andreas Veit (av443) and Vlad Niculae (vn66)

Introduction

As available datasets increase in size, machine learning models can successfully use more and more parameters. In applications such as computer vision, models with up to 144 million (Simonyan 2014) parameters are not uncommon and reach state-of-the-art performance. Experts can train and deploy such models on large machines, but effective use of lower-resource hardware such as commodity laptops or even mobile phones remains a challenge.

One way to address the challenge of large models is through model compression using hashing (Chen 2015). In general, this amounts to reducing a parameter set \(S = \{s_0, s_1, ... s_D\}\) to a greatly reduced set \(R = \{r_0, r_1, ..., r_d\}\) with \(d \ll D\) by randomly tying parameters to hash buckets (\(s_i = r_{h(i)}\)). This turned out to perform very well for neural networks, leading to the so-called HashedNets.

Many machine learning models involve several linear projections representable by matrix-vector products \(W \cdot x\) where x is input data and \(W\) consists of model parameters. In most such models, this linear algebra operation is the performance bottleneck; neural networks, in particular, chain a large number of matrix-vector products, intertwined with non-linearities. In terms of dimensionality, modern systems deal with millions training samples \(x_i\) lying in possibly high-dimensional spaces. The shape of \(W\), (\(d_{out}\), \(d_{in}\)), depends on how deep a layer is in a network: at the first layer, \(d_{in}\) depends on the data being processed, while \(d_{out}\) at the final layer depends on the desired system output (i.e., \(d_{out}=1\) for binary classification, and \(d_{out}=p\) if the output can fall in \(p\) classes). In middle layers, dimensionality is up to the model designer, and increasing it can make the model more powerful but bigger and slower. Notably, middle layers often have square \(W_h\). When \(W\) is stored in a reduced hashed format \(W_h\), many common trade-offs may change.

The goal of our project is to explore the performance bottlenecks of the \(W_h \cdot x\) operation where \(W_h\) is a hashed representation of an array that stays constant for many inputs \(x_i\). Since neural networks are typically trained with batches of input vectors \(x\) concatenated into an input matrix \(X\), we will look at the general case of matrix-matrix multiplication, where the left matrix is in a reduced hashed format \(W_h \cdot X\).

Taking advantage of massively parallel GPU architecture can be important even when dealing with smaller models. In March 2015, Nvidia announced a SoC for mobile devices with a GPU performance of 1 teraflop, the Tegra X1 (NVIDIA 2015); we foresee future mobile devices to have stronger and stronger GPUs.

The objectives of our project are to:

  1. Investigate fast applications of \(W_h\cdot X\) when \(W_h\) is small enough to be fully loaded into memory. In this case, is it faster to first materialize the hashed array and use existing fast linear algebra routines? Can the product be computed faster on a GPU with minimal memory overhead? This can lead to highly efficient deployment of powerful models on commodity hardware or phones.

  2. Analyze performance when even after hashing \(W_h\) is too big. In the seminal work that popularized the usage large scale deep convolutional neural networks and training using the GPU (Krizhevsky 2012), Krizhevsky predicts that GPUs with more memory can lead to bigger networks with better performance. Hashing-based compression can help practitioners prototype very large models on their laptops before deciding which configuration to spend cloud computing resources on. Can we make HashedNets training on GPUs efficient? This may involve forcing a predictable split on the hash function to allow for independent division of work.

Related work

Linear algebra on GPUs

Graphics processing units (GPUs) have evolved, driven by the video game market, to perform more specific kinds of computation faster than the general purpose CPU. They are particularly well suited for data-parallel computing in a single-instruction multiple-device (SIMD) fashion (Flynn 1966). The advantages of such a pipeline architecture for linear algebra operations have been noted early (Heller 1978). Today, with the commoditization of powerful GPUs, the scientific computing community has developed multiple libraries and toolkits for GPU-enabled linear algebra, including cuBLAS (NVIDIA 2008), CULA (Humphrey 2010), Magma (Agullo 2009), ViennaCL (Rupp 2010). Paralleling the basic linear algebra subsystem (BLAS) (Lawson 1979), such libraries implement general matrix-vector and matrix-matrix multiplication operators (gemv, gemm). CUDA documentation (NVIDIA 2015a) describes a simple implementation of gemm that takes advantage of the device memory structure; there are research efforts dedicated to exploring the bottlenecks and optimizing matrix multiplication in general (Volkov 2008) as well as for specific cases such as sparse matrices (Baskaran 2008, Bell 2008), but we are not aware of research on implicitly-defined matrices with repeated values. In fact, (Fatahalian 2004) argue that matrix-matrix multiplication can be suboptimal on GPUs because the higher cost of data fetching; our work will investigate whether this problem can be significantly alleviated through hash-based data compression.