Architecture overview for discriminator and generator networks. The α terms in the up and down blocks are used to gradually enable each audio LOD. When α is zero, these blocks are simple skip connections and become residual layers when α is one. The numbers on each node represent the output dimensions of the given layer. z is a 128 dimension random noise vector, sampled from the standard normal distribution. c is the conditional text embedding described in section \ref{940853}.