FIGURE LEGENDS
Figure 1. Schematic of the generative adversarial network (GAN)
method. A GAN consists of two neural networks: the generator and the
discriminator. The generator takes random variables from a latent space
as input and computes generated data via its neural network. The
discriminator takes the training data containing biomarkers and the
generated data from the generator as inputs. The neural network in the
discriminator is a binary classifier that computes the generator and
discriminator loss functions that are used to update the generator and
the discriminator neural networks via back propagation.
Figure 2. Figure 2 compares the probability density histogram
of a representative generated data set from the generative adversarial
network (teal bars) to the probability density histogram of the test
data (salmon bars) from the univariate analyses of 8 diabetes-associated
biomarkers. The dark gray bars correspond to the regions of overlap
between the two probability density histograms. Eight biomarkers are
shown: urine albumin (Figure 2A), urine creatinine (Figure 2B), fasting
glucose (Figure 2C), insulin (Figure 2D), body mass index (Figure 2E),
glycohemoglobin (Figure 2F), triglyceride (Figure 2G), and total
cholesterol (Figure 2H). The x -axes on all graphs are biomarker
levels that are log-transformed and scaled to lie between -1 and 1. Thep -values from the Kolmogorov-Smirnov test are shown on the top
left.
Figure 3. The t -stochastic neighbor embedding (t-SNE,
Figure 3A), uniform manifold approximation and projection (UMAP, Figure
3B) and principal component analysis two-dimensional projections of the
14-dimensional, diabetes-associated biomarkers data. The test data
results are shown in salmon circles and the GAN-generated results are in
teal circles. The x -axis (t-SNE X and UMAP X) and y -axis
(t-SNE Y and UMAP) correspond to the t-SNE and UMAP projections into two
dimensions of the input of 14-dimensional biomarker levels that are
log-transformed and scaled to lie between -1 and 1. The PC 1 and PC 2 on
the x -axis and y -axis of Figure 3C correspond to the first
and second principal components, respectively. Figure 3D is a pairs
panel that compares the univariate and bivariate GAN-generated
distributions (teal circles) to the test data (salmon circles). The
diagonal contains the univariate density for the GAN-generated and test
data distributions. The area of overlap is shaded dark gray. The upper
triangular region contains the Spearman bivariate correlation
coefficients for the test (salmon font) and GAN-generated distributions
(teal font). Only 7 of the 14 variables are shown. All variables were
log-transformed and scaled to lie in the range [-1, 1]: ALB:
Albumin, urine; CRE: Creatinine, urine; GLU: Fasting glucose; INS:
Insulin; BMI: Body mass index; GLHB: Glycohemoglobin; TG: Triglyceride.
Figure 4. The t -stochastic neighbor embedding (t-SNE)
two-dimensional projections of the 14-dimensional, diabetes-associated
biomarkers data for the Black, Hispanic, Other and White race
categories. The test data results are shown in salmon circles and the
GAN-generated results are in teal circles. The x -axis (t-SNE X)
and y -axis (t-SNE Y) correspond to the t-SNE projections into two
dimensions of the input of 14-dimensional biomarker levels that are
log-transformed and scaled to lie between -1 and 1.
Figure 5. Box plots of the univariate results from
14-dimensional, diabetes-associated biomarkers data for the Black,
Hispanic, Other and White race categories. The test data are shown in
salmon, and the GAN-generated results are in teal. The univariate
results for eight of 14 diabetes-associated biomarkers are shown: urine
albumin (Figure 5A), urine creatinine (Figure 5B), fasting glucose
(Figure 5C), insulin (Figure 5D), body mass index (Figure 5E),
glycohemoglobin (Figure 5F), triglyceride (Figure 5G), and total
cholesterol (Figure 5H). The y -axes on all graphs are biomarker
levels that are log-transformed and scaled to lie between -1 and 1. The
lines on the box correspond to the 25th quantile,
median and 75th quantile, the error bars correspond to
the median ± 1.5 inter-quartile range and the outliers are in black
circles.