Interpretability
Our baseline interpretability on the real dataset comes out at 72% (Fig. \ref{626565}), while interpretability on the generated sound effects comes out at 62% (Fig. \ref{576840}). We note that these values are somewhat artificially decreased over what they should be, due to our conditioning text dataset containing several single keywords that do not, on their own, describe the sounds they are paired with. Examples are keywords like 'hard', 'low' and 'fast', which we observed frequently while performing the interpretability tests with judges. When these keywords appear, judges are forced to mark their guessed keywords as incorrect, even when it may be clear what the sound effect actually represents. However, since these keywords appear as true labels in both the real and generated interpretability tests, the comparison should still be fair. Our interpretability score for generated sound effects is fairly high, however, this may be due to low number modalities covered by our evaluation generator, it is fairly easy for judges to guess one of the six main elements from the magic dataset (ice, fire, earth, air, black and generic) and be correct most of the time, although judges were asked to be more specific in their labelling (e.g. 'ice hitting a hard surface and shattering'). Fire was the most easily identifiable sound effect by our judges. Our model seems to have a fairly easy time generating various fire effects, perhaps because fire tends to be vary chaotic in nature so any noise from the generator is covered up. Many of our judges registered our generated impact sounds (especially snow impacts) as footsteps, indicated that we are missing some detail needed for indicating what kinds of objects are present in the collisions. Friction sound effects such as ice sliding over ice and falling debris were also difficult for our judges to identify due to their low quality as discussed below in our falling debris case study (\ref{786773}).