In the deep learning literature there have been many methods to produce images that correspond to certain classes or specific neurons in the CNN[Zeiler]. There are two main methods in the literature. Deconvolution methods rely on an input image and highlight pixels in an image that activate a neuron of interest. Deconvolution requires the presence of an image. There are other methods that try maximize the class scores or activations of neurons with respect to pixel intensities[Simonyan]. However these methods only work for lower level features or more shallow neurons. At higher layers, the neurons represent more abstract concepts such as a dog. Thus an optimal image may have 10 dogs all over the image in different orientations and also have tails and dog ears that are actually not part of a dog. We propose several potential methods that do not rely on an input image that can also create realistic abstract concepts that correspond to certain neurons ”deep” in the CNN.

The key reason abstract concepts such as ”dog” can not be generated using the above method is that that there are multiple features in multiple locations that may fire the ”dog” neuron. In real life however dog images do not occur with dogs all over the sky and big gigantic ears that exist by themselves without an attached body. Since intuitively, shallower neurons correspond to smaller features and the higher level neurons correspond to combinations of shallower features, a natural approach to fix the issue of generating unrealistic images would be to gather statistics of joint distributions of shallower features. We could use these statistics in a variety of ways. We could for example, use the optimization method mentioned in class and then look at the activations that the optimization method generates. If the activations of the shallow features seem to be an outlier of the joint distribution , we can decide that we need to reduce the activations of certain neurons. Once those neurons have been decided, we can back propagate from any one of those unneeded neurons, and instead take gradients steps to decrease the activation rather than increase it. This could be seen as a method combining both Deconv and and the method introduced by Simonyan.

One could also conceptually have joint distributions of layer k and layer k+1 for all k less than the number of layers. Now suppose we want to generate the abstract concept that a neuron N represents. Initially, we could find which activations of neurons in the previous layer are associated with N firing. This most likely follows some distribution. Thus we can sample the activations from the joint distribution where we fix the activation of N. Now we can use this same method over and over again and proceed back into the image where each time we fix in the joint distribution the activations of layer k+1, and sample the marginal for layer k.

As one can see many potential ideas seem plausible with the extra information of statistics generated from many images going through the convnet. We aim to try a few methods, improve our understanding, and then iterate to think of improved methods that might generate better images.

In our problem we will aim to use a pretrained CNN to generate random images corresponding to abstract concepts. We will use the pretrained VGGNET model with 16 layers from Oxford University. We will pass many images corresponding to a specific class (that we will get from ImageNet) to capture statistics of activations for neurons. We then will use our methods to generate random images corresponding to abstract concepts. We expect to be able to generate more realistic images than images generated by Simonyan and we can test this by simply comparing our generated images to created by Simonyan.

In this section, we describe our approach to the problem. We start with discussing the tools we used(Section 3.1), our technical approach(Section 3.2) and the preliminary results obtained(Section 3.3).

We explored the usage of several deep learning packages - Caffe, Tensor Flow(Keras) and Theano (Keras). We initially decided to use Caffe given the extensive availability of popular training Deep-learning models on the platform. However, we had difficulty setting up Caffe to work on local and cloud machines. We were finally able to setup the VGG model on Keras[] with Theano(Bastien 2012) as the backend. We have been successful in converting the VGG Net model on Caffe to work with Keras and were successful in validating the model with various random test images.

Our Approach to the Stochastic Deconvolution problem (described in Section 1) to recreate realistic images backwards from a class label, can be broken down into 3 main steps-

As of now, we have been able to calculate gradients of a downstream neuron‘s activations with respect to the input image to visualize the output sensitivity of the neuron. We are exploring the Keras & Theano packages to modify the gradient, using approaches similar to guided backprop, in order to better represent effects of different image areas on the activations of neurons in a given layer.

Given a target input class and a set of images corresponding to it, we want to be able to summarize the statistics of activations at each layer using techniques like PCA. We plan to save the statistics of neural activations at a chosen layer for the target class to disk. We then plan to sample a set of activations for the layer and work backwards from it to reconstruct an Image from it(Section 3.2.3). Since, we are using random sets of activations, which will be different each time we sample from the distribution, we are expecting to get different images for the target class each time.

Given a set of activations of neurons at the chosen layer, we then aim to reconstruct a representative of the class image. We will work backwards from the activations sampled in Step 2(Section 3.2.2) using gradients calculated in step 1(Section 3.2.1), to optimize an objective function corresponding to the image. We will also be adding other components to the objective function to ensure that the images formed are realistic eg. sharpness of edges, colour composition, etc. We plan to try different methods and objective functions to reconstruct the image in order to get realistic images for a target class.

We have been able to construct simple gradients of a neuron with respect to the input image. Gradients of neurons at different layers of VGG net are shown below in Figures \ref{fig:g1},\ref{fig:g2},\ref{fig:g3},\ref{fig:g4} .

From these results it seems that the neurons in the early layers do not contain information rich enough for us to sample and work backwards from. Neurons further down the layers represent more abstract and macro feature from the image eg. a neuron which fires up when the input image contains an ear. We think it would be ideal for us to re-construct the image using sampled activations from some layer in the middle as we hypothesize that the information contained is maximal somewhere in between the first and last layer. We aim to do experiments to figure out which layer would be the best to work with using our reconstruction algorithm.