High-Dimensional Biomarker Panel Joint Distribution Simulations
We developed and evaluated GAN for higher dimensional distributions.
Dataset and Data Pre-Processing: For these experiments, the
joint distribution of 14 of the 16 diabetes-relevant biomarkers from the
univariate setting was assessed.
High-sensitivity C-reactive protein (hs-CRP) and ferritin were excluded
from the list of biomarkers; ferritin was excluded because of sample
size and hs-CRP was excluded because assay methodologies changed across
the NHANES data sets.
GAN Architecture: The architecture of the conditional GAN was
based on Xu et al . 12 for tabular data.
Two fully connected hidden layers of size 256 were used in both
generator and discriminator. In the generator, batch-normalization and
ReLU activation functions were used after each fully connected layer. A
variational Gaussian mixture model was used to identify the modality of
the data and apply normalization specific to the mode. After two hidden
layers, the synthetic row representation is generated. The scalar values
of this representation are generated using tanh activation, while the
mode indicator and discrete values are generated by Gumbel softmax.
In the discriminator, we used leaky ReLU function and dropout on each
hidden layer. The PacGAN framework with 10 samples in each pack was used
to reduce mode collapse 13.
The model was trained for 1000 epochs with batch size of 300 and five
discriminator steps.
Data Analysis: For visualization, the t-distributed stochastic
neighbor embedding (t-SNE), uniform manifold approximation and
projection (UMAP) and principal components analysis (PCA) were used to
obtain the two-dimensional projections of the 14-dimensional data. TheRtsne , umap packages and prcomp function in R were
used. The perplexity and theta hyperparameters were set to 50 and 0.5,
respectively, for t-SNE. The ggpairs package was used to generate
pairs panel plots containing univariate densities, bivariate scatter
plots and Spearman rank correlation of the test data and GAN-generated
distributions. Seven of the 14 biomarkers were assessed in pairs panel
plots to keep the number and size of the bivariate plots amenable for
visual interpretation.