Identifiable signals in genotypic and genetic descriptors,
inferences and machine learning
Our second objective was to test for the ability of genotypic and
genetic descriptors to estimate specific rates of clonality. These
descriptors were commonly used in previous studies to roughly assess the
importance of clonality in determining population reproductive modes,
but no theoretical development has demonstrated the existence of
identifiable signals allowing such descriptors to be used as key
parameters with which to estimate rates of clonality. To assess the
existence of identifiable signals in these descriptors and demonstrate
their potential usefulness in inferring rates of clonality for one
episode of genotyping, we used the results obtained from the simulations
as classifiers to train a Bayesian supervised learning algorithm. We
used the simulation results to compute the approximate nonparametric
probability distributions of the genotypic and genetic descriptors
(i.e ., the seven features\(\varphi_{7}=\left[R,\beta_{p},\overset{\overline{}}{r_{d}},Mean\left[F_{\text{IS}}\right],Var\left[F_{\text{IS}}\right],Skew\left[F_{\text{IS}}\right],Kurt\left[F_{\text{IS}}\right]\right]\))
with combinations of Gaussian kernels under known rates of clonality,
resulting in a classifier with 12 classes (one class for each
rate of clonality to be inferred: c=0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6,
0.7, 0.8, 0.9, 0.99 and 1), hereafter referred to as \(C_{12}\).
\begin{equation}
L\left(\varphi_{7}\middle|C_{12}\right)=L\left(R,\beta_{p},\overset{\overline{}}{r_{d}},Mean\left[F_{\text{IS}}\right],Var\left[F_{\text{IS}}\right],Skew\left[F_{\text{IS}}\right],Kurt\left[F_{\text{IS}}\right]\middle|N,c,u\right)\nonumber \\
\end{equation}Provided that dependencies between the seven genotypic and genetic
descriptors are evenly distributed or cancel each other out or that
their distributions sufficiently segregate over their means per class,
we can approximate the joint probability model using the conditional
independence between features (Hand & Yu, 2001; Webb, Boughton, &
Wang, 2005; Zhang, 2004). The posterior probability of thei th class, given that the seven measured
features are known, can be expressed as the product of the seven
likelihoods of each feature weighted by the prior probability of the
class.
\begin{equation}
P\left(C_{i}\middle|\varphi_{7}\right)=p\left(C_{i}\right).\prod_{j=1}^{7}{L\left(\varphi_{j}\middle|C_{i}\right)}\nonumber \\
\end{equation}From this joint posterior probability, we identified the maximum a
posteriori (MAP ) to discern the class (“rate of clonality” and
“population size” pair) most likely to explain the measured features.
\begin{equation}
MAP=\operatorname{}\left[p\left(C_{i}\right).\prod_{j=1}^{7}{L\left(\varphi_{j}\middle|C_{i}\right)}\right]\nonumber \\
\end{equation}We assumed a uniform distribution prior, i.e. , equiprobability
for each class \(p\left(C_{i}\right)=1/12\), to place the algorithm
in an initial state of complete ignorance of the likely values that the
two parameters might take.
We built training and test databases of 100 and 30 replicates perrate of clonality and population size pair, respectively. We
explored by cross-validation whether there were enough identifiable
signals in the features of our classifier \(C_{12}\) to infer the true
rates of clonality with known values of only population genotypic
(\(R,\beta_{p}\)) and genetic
(\(F_{\text{IS}},\overset{\overline{}}{r_{d}}\)) indices alone and in
combination. Posterior distributions of the thirty test pseudo-observed
datasets per rate of clonality and population size pair
were combined to plot the results.
Results
We first explored the results at
equilibrium to understand the influence of clonality on R , Paretoβ , LD measured as ṝd, and the mean, variance,
skewness and kurtosis of FIS at three population sizes
(N =105, Figure 1;N =103, Figure S1a andN =104, Figure S1b) and then examined the
evolutionary dynamics of the parameters over generations to determine
the effect of clonality at different time steps and quantify the time
needed to converge towards stationary values (Figures 2 and S2). We
assessed which genotypic and genetic parameters produced the most
identifiable signal, allowing accurate inferences (Figures 3, S3 and
S4). Finally, we approached the issue of sampling strategy to determine
its effects on the accuracy of estimates for datasets obtained from
natural populations (Figures 4 and S2).