Future Research

Time Dependent Structure

We observe a number of failure cases in our present model which would be interesting areas for future research. Our first observation is that our present model fails to capture certain time dependent structures and long range dependencies. These include long range frequency changes, such as those typically present in power-up sounds, where the pitch increases or decreases as a function of time[compare generated to real waveforms], as well as audio signals that contain repeated structure (Fig. \ref{667351}), which require the generator to create multiple copies of the same audio signal positioned at different locations throughout the final output signal. In future, we would like to expand our current model, or a variant of it, to a recurrent architecture, which is able to capture these long range dependencies and better model time dependent structure. This would also allow us to generate audio of varying length, instead of being limited to fixed length output. We note that it has been previously observed that the common LSTM and GRU recurrent networks have difficulty reliably repeating patterns in output signal over long time periods\cite{graves2014neural}. As such, a model with explicit memory, such as a Neural Turing Machine (NTM)  \cite{graves2014neural}, or the more modern Differentiable Neural Computer (DNC)\cite{graves2016hybrid} could be an interesting avenue of future research, in order to generate the repeated structures that are often found in audio data. We theorize that it could be possible for a typical GAN network, such as the one we have presented here, to generate the base waveforms required for the audio signal. These structures could then be stored in explicit memory inside a DNC's or NTM's memory, and written out by the controller network at the appropriate locations in the output signal. We prefer a GAN approach to training over explicitly modelling the loss as the error on the training signal, as GANs are able to generate individual samples from a plausible set, while explicit loss models are indecisive and tend to generate the average over a set of plausible samples. For images, this results in the typical blurry output seen in many generative models. Applications of the GAN model to recurrent architectures has been relatively little explored, and we believe this would be an interesting area for future research.

Layered Sound Effects

Our model also has trouble generating layered sound effects. The presence of layered signals in audio data presents an interesting distinction from the image data that GANs are typically trained on. Image data typically contains less layered signals than audio data. Although image data can contain layered signals where the image contains reflections and transparency, there are typically only two or three signal layers in these cases, whereas audio data typically contains many overlapping signals. Sound effects especially, are typically built up by layering different sounds on top of one another to produce rich and expressive audio signals. We find that our generative model has a tendency to produce somewhat muddy audio signals, which do not contain clearly distinguishable audio layers. An interesting area of future research would be to create a network that can produce a varying number of outputs, with each output corresponding to a separate audio source signal. These audio signals could then be combined to produce the final output signal. Again, an NTM or DNC network could be used to handle the varying number of output signals. In order to train a network like this, it would first be necessary to perform blind signal separation on the training data in order to separate out individual source signals from the original mixed signals. A network could then be trained to generate these source signal sets, instead of the original mixed signal.

Automatic Metrics

In this research we relied on human judges to determine the quality of our model. However, during development it is difficult, time consuming and expensive to use human judges to evaluate the performance of different models against one another. Therefore, it is desirable to have a method of automatic evaluation which correlates well with human judges. Probably the most common automatic metric for generative models is the inception score, which has been shown to correlate with human judgment\cite{barratt2018note}. However, this model has recently been superseded by a new metric known as the Fréchet Inception Distance\cite{heusel2017gans}, which compares the statistics of the generated signals against those of the real signals, instead of evaluating generated signals in isolation. Both these evaluation metrics use coding layers from publicly available pre-trained inception classifier networks, in order to calculate their respective metrics and insure consistent evaluation across different research papers. The inception networks used by these automatic metric algorithms are typically trained to classify images in the imagenet dataset. However, there is unfortunately no corresponding inception network trained to classify a variety of audio signals. Creating such a network is an important area of future research, as it will allow for computing of the Fréchet Inception Distance on audio data, which can then be used to quickly evaluate the quality of a given generative model.