. These network both use very similar architectures. The basic idea is to use multiple generators, instead of a single generator. Each generator is trained to produce a specific modality from the dataset. This is done by having the discriminator try to classify both whether the image is real or not, and what generator produced the image. In this way, each generator is trained to diversify itself from other generators, such that the discriminator is easily able to tell them apart. In order to reduce complexity these networks often use weight sharing between the different generators (usually all the weights are shared except in the final layers), although the authors on MADGAN note that not using weight sharing allows for generating from more diverse datasets.
StackGAN++ also splits the generative problem up into mutiple pipeline stages, similar to ProgressiveGAN. However StackGAN++ trains the entire pipeline at once, rather than progressively adding stages as one stage converges. StackGAN is of particular interest, as they are solving a text-to-image problem, which is similar to our text-to-sfx problem. StackGAN++ splits the generative problem into two different ones, generating realistic looking images, and generating images that match a specific condition. The discriminator in this network splits in two at the final layer, with one path mixing in the condition, and outputs a score for real images and matching conditions. In this way the discriminator is explicitly trained to tell if a generated image belongs to the category it was conditioned on. However, unlike MADGAN, it is unable to tell specifically what modality the generated image belongs to. If multiple generators are used, as in MADGAN, this is likely to cause the generators to all learn the same output distribution which maximally fools the discriminator, rather than diversifying into multiple modalities.