Evaluation Methodology

We evaluate our model using a panel of five human judges. Evaluation covers five different criteria, sound effect interpretability, sound effect realism, sound effect diversity, condition matching and overall sound effect quality. We use a model trained on the Magic Elements dataset for our evaluation. This dataset has many distinct modalities, while also providing a limiting context for sound effect generation. This allows judges to quickly understand what kind of sound effects can be generated so as to limit conditioning text queries to be in line with the generators knowledge. It also helps judges interpret sound effects when visual feedback is lacking by narrowing the scope of sound effects to the category of Magic.
Our interpretability test is designed to evaluate whether the generator is able to produce sounds that are immediately recognizable. This is an important quality for sound effects, as they are used, along with visual feedback, to sell a particular action in game. If the sound effects are not immediately recognizable, they may not be able to adequately sell an effect and may instead create a disconnect between between what the player is hearing, and what is happening in game. To evaluate interpretability, we first establish a baseline for interpretability of the real sound effects. To do this, for each judge, we draw ten sound effects from the dataset at random. Each judge then labels each sound effect with what spell, or what general magic related action, they think the sound effect represents. We then ask each judge to label ten generated sound effects in the same manner. Conditioning texts for the generated samples are selected in the same manner as described above in section \ref{151756} for the training process. Afterwards, judges are shown the true conditioning texts for both the real and generated examples. Each judge then decides whether they correctly labeled each sample or not. We leave this decision up to the judges, as perceived correctness is more important than exact matches when it comes to interpretability. 
The realism test is a straightforward test to see if users are able to distinguish the real sound effects, from the generated ones. Each judge is presented with ten sound effect, which are either real or generated. The judges are not told which sound effects are real or generated up front. New sound effects are also generated for each judge. The judges then evaluate which sound effects they think are real and which they think are generated. Afterwards we compare the judges answers with the correct answers to see how well the judges were able to pick out the generated sound effects. Again, conditioning texts for the generated samples are selected in the same manner as described in section \ref{151756}
The conditioning, quality and diversity tests are combined together. This is done to ensure we evaluate quality and diversity on test data and not training data where we may overfit. Each judge enters ten different conditioning texts to be used to generate sound effects. Five sound effects are then generated for each given conditioning text. Judges are asked to evaluate whether they think any of the generated samples for a given conditioning text match. The possible answer are Yes, No, and Maybe/Not Quite. Judges are also asked to evaluate the audio quality and diversity for each set of five samples for a given conditional text. Evaluation of audio quality is done on a scale from one to five, where one represents unstructured noise and five represents a sound effect that is usable in game. Diversity is also evaluated on a scale from one to five, where one represents no diversity between samples, and five represents very high diversity between samples.