architecture to 1D to work with audio data. We replace the existing 1D DCGAN model with our own model, heavily inspired by Progressive GAN [paper reference].
Conditioning
We also extend the model to support conditioning text, taking heavy inspiration from StackGAN's [paper reference] text conditioning implementation. StackGAN's model uses text embeddings, which are downsampled to 128 dimensions, however, the embedding model used is not specified. We choose to use ELMo [Reference] for our text embedding for the reasons stated in section (\ref{151756}).
Progressive Growing
Progressive growing allows our model to first learn how to generate simpler, lower frequency versions of sound effects in the dataset, then progressively adds higher frequency detail as the model grows. We implement progressive growing by giving each up and down block an amount, α, that it is turned on by. This amount is determined by the current level of detail (audio LOD) that the model is being trained at (see fig. \ref{471330}). The output of the 'to audio' layer, after the first residual block in the generator, constitutes the first audio LOD, and consists of only sixteen samples (fig. \ref{471330}). Each up and down block can either be in one of three modes. Fully off, in which case we treat the layer as a skip connection, implemented by a simple nearest neighbor, or average downsample for up and down blocks respectively. Fully on, in which case the skip connection is ignored and the layer becomes a residual block. Or partially on, which occurs during LOD transitioning. In this case we linearly interpolate between the output of the skip connection, and the output of the residual block. Since the generator and discriminator mirror each other, once a signal has passed the current LOD layer in the generator network, it will then essentially skip the rest of the generator layers and be passed through to the layer in the discriminator that matches the current LOD. Skipping to the correct discriminator layer happens due to the generator up-sampling and discriminator down-sampling being inverse operations to one another. See figures \ref{471330} and \ref{801834} for a detailed overview our architecture.