Conditioning Text
Conditioning text is added to the model via ELMo text embeddings\cite{peters2018deep}. ELMo has recently shown state of the art results on many natural language processing tasks\cite{peters2018deep} and can be easily dropped in as a replacement for the more standard Word2Vec\cite{peters2018deep} or Glove\cite{pennington2014glove} text embedding models. Since ELMo works at the character level, it can easily support misspelled or made up words, which makes it a good fit for embedding game sound effect descriptions, where the description of a sound effect may contain made up, misspelled, or conjoined words. We collect conditioning text for each sound effect by analyzing their original file path, filename and any additional metadata included in the WAV file itself. We then pass this text through a series of dataset specific and non-dataset specific filters to remove non-descriptive words, such as the name of the collection a sound effect came from, and numeric values which are not useful for training. Finally we use spaCy to extract nouns, actions and descriptive phrases from the filtered text. At training time, for each example audio clip, we randomly select one of its associated texts to use during training. This text is first embedded with ELMo \cite{peters2018deep} before passing it to the model for training.