Text to Sound Effect Synthesis using Generative Adversarial Networks

loading page

Lance Chaney,
fawad.zaidi

Abstract

Currently, the available sound effect generators for games produce only 8 bit retro style sound effects and are not suitable for use in any other genres of games. Given the importance of good sound effects to the overall experience of a game and the lack of highly skilled sound engineers in the indie development space, we explore new avenues for sound effect generation that are able to produce sound effects for a wider range of genres, without the need of any technical expertise. We produce a novel neural network model that is able to produce short snippets of audio similar to those present in whatever dataset it is trained on. In addition we add the ability to enter conditional text into the model and retrieve a corresponding generated sound effect out. Our results show that users are able to generate sound effects, that match their user entered conditioning text, with a probability of above 50%. We also observe a high quality and interpretability of sound effects. With an average of 3.78 out of 5 quality rating, as rated by a panel of human judges, and an interpretability score of 62%, only 10% below our baseline interpretability of 72% on the real data. However, increasing the generated sound effect diversity, for any given conditioning text, is still an important area of open research that is left to be explored.