Model
Our model builds heavily on the WaveGAN architecture
\cite{donahue2018wavegan}. The original WaveGAN implementation ported the DCGAN
\cite{radford2015unsupervised} architecture to 1D to work with audio data. We replace the existing 1D DCGAN model with our own model, heavily inspired by Progressive GAN
(Karras 2017) .
Conditioning
We also extend the model to support conditioning text, taking heavy inspiration from StackGAN's \cite{dimitris2017} text conditioning implementation. StackGAN's model uses text embeddings, which are downsampled to 128 dimensions, however, the embedding model used is not specified. We choose to use ELMo \cite{peters2018deep} for our text embedding for the reasons stated in section (\ref{151756}).
Progressive Growing
We extend the original Progressive GAN model \cite{karras2017progressive} by replacing pairs of convolutions with residual blocks, containing two convolutions and one skip connection each (Fig. \ref{801834}). We use similar residual blocks to those used in the Improved Training of Wasserstein GANs paper \cite{gulrajani2017improved}. We stick to the pre-activation scheme used in the paper as it has been previously shown to give better results than other activation schemes\cite{he2016identity}. Our residual block scheme is shown in figure \ref{801834}.
Progressive growing allows our model to first learn how to generate simpler, lower frequency versions of sound effects in the dataset, then progressively adds higher frequency detail as the model grows. We implement progressive growing by giving each up and down block an amount, α, that it is turned on by. This amount is determined by the current level of detail (audio LOD) that the model is being trained at (see fig. \ref{471330}). The output of the 'to audio' layer, after the first residual block in the generator, constitutes the first audio LOD, and consists of only sixteen samples (fig. \ref{471330}). Each up and down block can either be in one of three modes. Fully off, in which case we treat the layer as a skip connection, implemented by a simple nearest neighbor, or average downsample for up and down blocks respectively. Fully on, in which case the skip connection is ignored and the layer becomes a residual block. Or partially on, which occurs during LOD transitioning. In this case we linearly interpolate between the output of the skip connection, and the output of the residual block. Since the generator and discriminator mirror each other, once a signal has passed the current LOD layer in the generator network, it will then essentially skip the rest of the generator layers and be passed through to the layer in the discriminator that matches the current LOD. Skipping to the correct discriminator layer happens due to the generator up-sampling and discriminator down-sampling being inverse operations to one another. See figures \ref{471330} and \ref{801834} for a detailed overview our architecture.