Introduction
Sound engineering for games is an important and challenging task for game developers. Good sound effects can make a game a unique and memorable experience. However, sound effects are often neglected, and left to the last minute when working in small indie teams. The primary reason is due to not having a dedicated sound engineer available. Unfortunately a dedicated sound engineer is out of reach for many small indie teams with limited budgets. For non-experts, the the task of creating, finding and modifying sound effects to fit a particular game scenario can be very difficult and can have unsatisfactory results. For this reason sound effect generation tools like Bfxr\cite{games} are invaluable tools for indie developers. Unfortunately, these tools only work for a small subset of games genres, namely the 8 bit pixel genre, as they are only able to generate very simplistic sound effects. The inability to produce complex sound effects comes from these tools reliance on old procedural generation methods. In these tools basic sound waves (sin, sawtooth, square, etc) are used as a base, then combinations of various effects and modifiers such as attack time, sustain time, compression, frequency slide, and flanger sweep, are applied to produce the final sound effect. Variations of these sound effects can then created by randomly perturbing the modifier parameters to produce different sounds. In recent years machine learning has come a long way in the areas of both image and voice synthesis. In particular, several generative models have recently succeeded in producing output that is nearly indistinguishable from the real thing\cite{van2016wavenet}\cite{karras2017progressive}. In this work we apply these recent advances in machine learning to the task of sound effect generation for games. Our approach is based on generative adversarial networks, which have recently had success in generating photo-realistic images at resolutions up to 1 mega pixel\cite{karras2017progressive}. These models can only produce fixed size output, however, this is a good fit for sound effect generation where many sound effects are less than 1 second in duration. Generation of audio samples of arbitrary length is left as an open area for future research.
Existing Research
Generative Adversarial Networks (GANs) are most well known for being able to produce highly realistic images. In comparison to other models such as autoencoders, where the output is evaluated directly against the training data, GANs produce much better fine detail and do not result in blurry images\cite{radford2015unsupervised}. Inspired by the WaveGAN architecture \cite{donahue2018wavegan}, which uses a GAN to generate short audio clips instead of images, we build a model based on the latest research in image synthesis. Reapplying techniques from these models to the area of sound effect synthesis.
Unfortunately, GANs are notoriously difficult to train and will often fail to converge. Vanishing gradients, mode collapse and catastrophic forgetting are all common problems plaguing the current GAN architectures\cite{Thanh-Tung2018-ev}. The current state of the art method for addressing these issues is to use the improved Wasserstein GAN training metric (WGAN-GP)\cite{Gulrajani2017-xz}. This metric works very well to ensure training stability. However, when trained on multi-modal datasets, WGAN-GP can still fail to converge due to the nature of the training metric pushing generated samples towards random real samples. This causes the generator to oscillate, pushing generated samples back and forth between different modalities\cite{Thanh-Tung2018-ev}. Be that as it may, in our experiments we find that alternative training metrics fail to converge and WGAN-GP is required in order to ensure stable training.
Several recent papers such as Progressive GAN\cite{karras2017progressive} have found it beneficial to split the generative problem into smaller, easier to solve generative problems, that then build up to solve a harder generative problem. Progressive GAN\cite{karras2017progressive} starts by training a very shallow network to generate low resolution images, then gradually adds more stages to the existing model, to generate higher, and higher resolution images. The key here is training different parts of the network to solve different problems. All the stages of Progressive GAN\cite{karras2017progressive} are explicitly trained to generate images of a higher resolution from images of a slightly lower resolution. This is a much simpler problem than going directly from a low dimensional latent vector, to the high dimensional output space of an image. By training a network to solve small sub problems, that can be combined to solve a larger problem, the network is able to learn to generate extremely convincing results at high resolutions.
Splitting up the generative problem into multiple, easier to solve problems, has also been used to address the mode collapse problem \cite{hoang2017multi}, \cite{Park2018-ez}, \cite{ghosh2017multi}. These networks all use very similar architectures. The basic idea is to use multiple generators, instead of a single generator. Each generator is trained to produce a specific modality from the dataset. This is done by having the discriminator try to classify both whether the image is real or not, and what generator produced the image. In this way, each generator is trained to diversify itself from other generators, such that the discriminator is easily able to tell them apart. In order to reduce complexity these networks often use weight sharing between the different generators (usually all the weights are shared except in the final layers), although the authors on MADGAN note that not using weight sharing allows for generating from more diverse datasets \cite{ghosh2017multi}. While we did not evaluate the use of multiple generators in our model, we believe that these models will be an important aspect of future research, and will allow generative networks to cover a much broader range of modalities (distinct groupings or categories of data points) than they are currently able to in the near future.
StackGAN++\cite{dimitris2017} also splits the generative problem up into multiple pipeline stages, similar to ProgressiveGAN
\cite{karras2017progressive} . However StackGAN++
\cite{dimitris2017} trains the entire pipeline at once, rather than progressively adding stages as the previous stages converge. StackGAN is of particular interest to us, as they are solving a text to image thesis problem, which is very similar to our own text to sound effect synthesis problem. StackGAN++ splits the generative problem into two different ones, generating realistic looking images, and generating images that match a specific condition. The discriminator in this network splits in two at the final layer, with one path outputting a score for condition matching and the other outputting a score for the image realism. In this way the discriminator is explicitly trained to tell if a generated image belongs to the category it was conditioned on
\cite{dimitris2017} .