Identifiable signals in genotypic and genetic descriptors, inferences and machine learning
Our second objective was to test for the ability of genotypic and genetic descriptors to estimate specific rates of clonality. These descriptors were commonly used in previous studies to roughly assess the importance of clonality in determining population reproductive modes, but no theoretical development has demonstrated the existence of identifiable signals allowing such descriptors to be used as key parameters with which to estimate rates of clonality. To assess the existence of identifiable signals in these descriptors and demonstrate their potential usefulness in inferring rates of clonality for one episode of genotyping, we used the results obtained from the simulations as classifiers to train a Bayesian supervised learning algorithm. We used the simulation results to compute the approximate nonparametric probability distributions of the genotypic and genetic descriptors (i.e ., the seven features\(\varphi_{7}=\left[R,\beta_{p},\overset{\overline{}}{r_{d}},Mean\left[F_{\text{IS}}\right],Var\left[F_{\text{IS}}\right],Skew\left[F_{\text{IS}}\right],Kurt\left[F_{\text{IS}}\right]\right]\)) with combinations of Gaussian kernels under known rates of clonality, resulting in a classifier with 12 classes (one class for each rate of clonality to be inferred: c=0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99 and 1), hereafter referred to as \(C_{12}\).
\begin{equation} L\left(\varphi_{7}\middle|C_{12}\right)=L\left(R,\beta_{p},\overset{\overline{}}{r_{d}},Mean\left[F_{\text{IS}}\right],Var\left[F_{\text{IS}}\right],Skew\left[F_{\text{IS}}\right],Kurt\left[F_{\text{IS}}\right]\middle|N,c,u\right)\nonumber \\ \end{equation}
Provided that dependencies between the seven genotypic and genetic descriptors are evenly distributed or cancel each other out or that their distributions sufficiently segregate over their means per class, we can approximate the joint probability model using the conditional independence between features (Hand & Yu, 2001; Webb, Boughton, & Wang, 2005; Zhang, 2004). The posterior probability of thei th class, given that the seven measured features are known, can be expressed as the product of the seven likelihoods of each feature weighted by the prior probability of the class.
\begin{equation} P\left(C_{i}\middle|\varphi_{7}\right)=p\left(C_{i}\right).\prod_{j=1}^{7}{L\left(\varphi_{j}\middle|C_{i}\right)}\nonumber \\ \end{equation}
From this joint posterior probability, we identified the maximum a posteriori (MAP ) to discern the class (“rate of clonality” and “population size” pair) most likely to explain the measured features.
\begin{equation} MAP=\operatorname{}\left[p\left(C_{i}\right).\prod_{j=1}^{7}{L\left(\varphi_{j}\middle|C_{i}\right)}\right]\nonumber \\ \end{equation}
We assumed a uniform distribution prior, i.e. , equiprobability for each class \(p\left(C_{i}\right)=1/12\), to place the algorithm in an initial state of complete ignorance of the likely values that the two parameters might take.
We built training and test databases of 100 and 30 replicates perrate of clonality and population size pair, respectively. We explored by cross-validation whether there were enough identifiable signals in the features of our classifier \(C_{12}\) to infer the true rates of clonality with known values of only population genotypic (\(R,\beta_{p}\)) and genetic (\(F_{\text{IS}},\overset{\overline{}}{r_{d}}\)) indices alone and in combination. Posterior distributions of the thirty test pseudo-observed datasets per rate of clonality and population size pair were combined to plot the results.
Results
We first explored the results at equilibrium to understand the influence of clonality on R , Paretoβ , LD measured as ṝd, and the mean, variance, skewness and kurtosis of FIS at three population sizes (N =105, Figure 1;N =103, Figure S1a andN =104, Figure S1b) and then examined the evolutionary dynamics of the parameters over generations to determine the effect of clonality at different time steps and quantify the time needed to converge towards stationary values (Figures 2 and S2). We assessed which genotypic and genetic parameters produced the most identifiable signal, allowing accurate inferences (Figures 3, S3 and S4). Finally, we approached the issue of sampling strategy to determine its effects on the accuracy of estimates for datasets obtained from natural populations (Figures 4 and S2).