2.6 Demographic modeling and analysis of gene flow
To test for present day migration and historical gene flow between our
populations, we used the R package delimitR (Smith & Carstens, 2020;
https://github.com/meganlsmith/delimitR). This program uses a binned
multidimensional folded site frequency spectrum (bSFS; Smith, Ruffley,
Tank, Sullivan, & Carstens, 2017) and a random forest machine learning
algorithm to compare speciation models such as no divergence, divergence
with and without gene flow, and divergence with secondary contact (Smith
& Carstens, 2020). A bSFS was used because it stores the observed
frequencies of the minor alleles for multiple populations and bins them
to avoid inference problems associated with sampling too few segregating
sites (Smith et al., 2017; Terhost & Song, 2015). DelimitR was chosen
over more traditional multi-species coalescent methods because of its
ability to take historical and current gene flow into account (Leaché,
Harris, Rannala, & Yang, 2014; Smith & Carstens, 2020). Demographic
histories are simulated using the multi-species coalescent model
implemented through fastsimcoal2 (Excoffier, Dupanloup, Huerta-Sánchez,
Sousa, & Foll, 2013) under a user-specified guide tree and set of
priors on divergence times, population sizes, and migration rates. The
random forest classifier then creates a user-defined number of decision
trees from a subset of the prior. Each decision tree compares the
empirical bSFS to the SFS of each simulated speciation model and votes
for the most similar model. The demographic model with the largest
number of votes is chosen as the best model. Out-of-bag error rates are
used to assess the power of the random forest classifier. The posterior
probability of the selected model is then calculated by regressing
against the out-of-the-bag error rates following Pudlo et al. (2015).
We created folded multi-dimensional site frequency spectrums for the twoT. blandingii clades and the two Central African T.
pulverulenta clades using easySFS
(https://github.com/isaacovercast/easySFS), a wrapper for ∂a∂i
(Gutenkunst, Hernandez, Williamson, & Bustamante, 2009). The West
African T. pulverulenta clade was not included because of the low
sample size available for this lineage. We simulated 100,000 data sets
under four models: no divergence (Model 1), divergence without gene flow
(Model 2), divergence with secondary contact (Model 3), and divergence
with gene flow (Model 4). Priors for both models were drawn from uniform
distributions for population size: 10000 – 1000000 haploid individuals
(twice the number of estimated diploid individuals), divergence time:
20000 –2000000 generations, migration rate: 0.000005–0.005
corresponding to 0.05–5 migrants per generation. We then coarsened our
empirical site frequency spectra to 10 bins each. Our out-of-bag error
rates were calculated, and 500 random forest classifiers were simulated
using 100,000 pseudo-observed data sets for each model. A confusion
matrix was calculated to determine how often the correct model was
selected and posterior probability for the ”best” model was estimated
for each species.