SoundNet: Learning Sound Representations from Unlabelled Video \cite{aytar2016soundnet}

project website: https://projects.csail.mit.edu/soundnet/
pdf link: https://arxiv.org/pdf/1610.09001.pdf
github link: https://github.com/cvondrick/soundnet

Main points:

- Task: Sound scene/events detection
- System: Teacher-student training procedure.
    Teacher: visual recognition models. Video as input
    Student: all convolutional DNN. Waveform as input
- Database: from 2M unlabelled videos
- Results: state of the art for acoustic scene/object classification

Introduction

- No progress in natural sound understanding. One of the reasons is the lack of labeled datasets of sound.
- Paper proposes to transfer the progress in computer vision into sound using unlabelled video databases as bridge.
- Deep fully convolutional network that learns from waveforms. Trained with visual supervision. No dependence on vision during inference.
- Primary contribution: development of a large-scale and semantically rich representation for natural sound.

Related Work

Sound Recognition
- Focus on natural sounds, not music or speech.
- Methods are limited by the amount of data.
- Paper uses large scale data set 2M videos, which allows deeper networks.
- Sound recognition supervised by rich visual models.
Transfer learning
- Teacher-student: compress knowledge from a complex model to a simpler model without loosing considerable accuracy.
- Usually teacher and student are in the same modality, in the paper teacher's modality is video and student's is audio.
Cross-modal learning and unlabelled video

Large Unlabelled Video Dataset

- 2M videos from Flickr (Flickr because not professionally edited)
- Post-processing: convert to mp3, sampling rate to 22kHz, to mono and to scale in [-256,256]

Learning Sound Representations

Deep Convolutional Sound Network

Convolutional Network

- One-dimensional convolutions followed by nonlinearities.
- Convolutional networks well suited for audio signals because: 1. invariant to translations (this reduces number of parameters). 2. Convolutional networks allow us to stack layers (enables to detect higher-level concepts through lower-level detectors).
Variable Length Input/Output
- Fully convolutional network will allow us to handle input and output of variable lengths.
Network Depth
- experiment with two networks: 1. five layers. 2. eight layers.

Visual Transfer into Sound

- State-of-the-art networks for vision teach the network introduced in the paper for sound to recognize scenes and objects.
waveform