2.6 Demographic modeling and analysis of gene flow
To test for present day migration and historical gene flow between our populations, we used the R package delimitR (Smith & Carstens, 2020; https://github.com/meganlsmith/delimitR). This program uses a binned multidimensional folded site frequency spectrum (bSFS; Smith, Ruffley, Tank, Sullivan, & Carstens, 2017) and a random forest machine learning algorithm to compare speciation models such as no divergence, divergence with and without gene flow, and divergence with secondary contact (Smith & Carstens, 2020). A bSFS was used because it stores the observed frequencies of the minor alleles for multiple populations and bins them to avoid inference problems associated with sampling too few segregating sites (Smith et al., 2017; Terhost & Song, 2015). DelimitR was chosen over more traditional multi-species coalescent methods because of its ability to take historical and current gene flow into account (Leaché, Harris, Rannala, & Yang, 2014; Smith & Carstens, 2020). Demographic histories are simulated using the multi-species coalescent model implemented through fastsimcoal2 (Excoffier, Dupanloup, Huerta-Sánchez, Sousa, & Foll, 2013) under a user-specified guide tree and set of priors on divergence times, population sizes, and migration rates. The random forest classifier then creates a user-defined number of decision trees from a subset of the prior. Each decision tree compares the empirical bSFS to the SFS of each simulated speciation model and votes for the most similar model. The demographic model with the largest number of votes is chosen as the best model. Out-of-bag error rates are used to assess the power of the random forest classifier. The posterior probability of the selected model is then calculated by regressing against the out-of-the-bag error rates following Pudlo et al. (2015).
We created folded multi-dimensional site frequency spectrums for the twoT. blandingii clades and the two Central African T. pulverulenta clades using easySFS (https://github.com/isaacovercast/easySFS), a wrapper for ∂a∂i (Gutenkunst, Hernandez, Williamson, & Bustamante, 2009). The West African T. pulverulenta clade was not included because of the low sample size available for this lineage. We simulated 100,000 data sets under four models: no divergence (Model 1), divergence without gene flow (Model 2), divergence with secondary contact (Model 3), and divergence with gene flow (Model 4). Priors for both models were drawn from uniform distributions for population size: 10000 – 1000000 haploid individuals (twice the number of estimated diploid individuals), divergence time: 20000 –2000000 generations, migration rate: 0.000005–0.005 corresponding to 0.05–5 migrants per generation. We then coarsened our empirical site frequency spectra to 10 bins each. Our out-of-bag error rates were calculated, and 500 random forest classifiers were simulated using 100,000 pseudo-observed data sets for each model. A confusion matrix was calculated to determine how often the correct model was selected and posterior probability for the ”best” model was estimated for each species.