is the KL-divergence. KL-divergence is used because the outputs from $g_k$ can be interpreted as a distribution of categories.
KL-div is differentiable, optimized using back prop and stochastic gradient descent.
Transfer scene and object visual networks (K=2).

Sound Classification

- The categories to categorize with sound may not appear in visual models. For that, output layer is ignored and the internal representation of a layer is used as input features to train a linear SVM.

Implementation

- Torch7
- Adam optimizer
- learning rate: 0.001
- momentum term: 0.9
- batch size: 64
- weights initialized to zero mean gaussian noise with std: 0.01
- Batch normalization after each convolution
- 100,000 iterations
- Optimization took 1 day on a GPU

Experiments

Two trainings: one with videos and one with sound only.
1st training:
- 2M videos for training
- 140,000 for validation
2nd training:
- Use hidden representation of the trained network as a feature extractor for learning on smaller labeled sound-only datasets.
- train SVM

Acoustic Scene Classification

- Databases DCASE, ESC-50, ESC-10 are described.

Ablation Analysis

Comparison of Loss and Teacher Net
- Performance improves with visual supervision
- Using both ImageNet and Places networks as supervision better than single one.
Comparison of Network Depth
- eight-layer architecture is 8% better than five-layer network.
- five-layer network still better than state-of-the-art.
Comparison of Supervision
- Train network without video, just using target sound training set.
- Output of network is class probabilities
- five-layer network performs slightly better than a convolutional network trained with same data.
- eight-layer network performs worse, maybe because overfitting
Comparison of Layer and Teacher Network
- Features from pool5 layer give best performance
- Tried different teacher networks, one was better for a database, the other was better for another database. So, inconclusive.

Multi-Modal Recognition

Vision vs Sound Embeddings
- 2-dimensional t-SNE to visualize features from visual networks and SoundNet.
- Sound features alone also contain considerable amount of semantic information.
Object and Scene Classification
- Trained a SVM over both sound and visual features.
- Sound is not as informative as vision, it still contains considerable amount of discriminative information.

Visualizations

- Visualize what network learned.
- Learned filters are diverse: low and high frequencies, wavelet-like patterns, increasing and decreasing amplitude filters.

Conclusion

- Train deep sound networks (SoundNet) by transferring knowledge from established vision networks and large amounts of unlabelled video.
- Transfer resulted in semantically rich audio representations for natural sounds. Powerful paradigm.