Introgression, incomplete lineage sorting (ILS) and hybrid speciation
TreeMix (Pickrell & Pritchard, 2012) was used to reconstruct the evolutionary relationships among species using a maximum likelihood tree and accounting for potential introgression events. We constructed consensus trees from m = 1 to m = 10 with 100 bootstrap replicates; 10 independent runs were executed for each value ofm . We estimated the optimal number of migration events using the R package OptM (Fitak, 2021) by assessing changes in likelihood (Δm ) across incremental values of m . The final ML tree with the best m and bootstrap values was visualized according to scripts written by Zecca, Labra, and Grassi (2020).
To further detect the footprints of introgression, we attempted to identify hybrid nodes in the phylogenetic networks based on the species tree (see details in the Supplemental Materials and Methods) using PhyloNetworks (Solís-Lemus, Bastide, & Ané, 2017). To reduce the computational burden, the analyses were run using a subset of 70 individuals randomly selected from each clade and population. Quartet concordance factors (CF) were calculated from SNP matrices using the R function SNPs2CF (Olave & Meyer, 2020). The best number for hybrid nodes was searched from hmax = 1 to hmax = 5, and the scenario with the lowest negative log pseudo-likelihood was evaluated using bootstrap analysis with the bootsnaq function in PhyloNetworks with 10 runs per replicate and 100 bootstrap iterations.
We then aimed to detect the specific introgression signal by estimating Peterson’s D (ABBA-BABA) and related statistics in Dsuite (Malinsky, Matschiner, & Svardal, 2021). D and f4statistics were estimated using all possible combinations of trios within the phylogenetic relationships of the species tree. We used thef -branch test (fb ) in Dsuite to allocate gene flows between specific branches of the species tree, addressing false positives from D -statistics comparisons and ancient introgression. The fb test was also performed to detect the introgression signal in each gene.
Because of the difficulty of distinguishing recent introgression from ILS, we used QuIBL (Edelman et al., 2019), a tree-based method that compares models with ILS and introgression with models with only ILS to obtain localized information on gene flow. We implemented BIC values to select the ILS-only model or alternative ILS and introgression model for each tested triplet. One individual with the highest coverage from each species was kept, and the sequences of the 52 LCN genes were used as the input for QuIBL. The percentages of gene sequences and SNPs showing evidence for introgression were also calculated.
We built and examined two independent sets of evolutionary scenarios revealed by PhyloNetworks and StrAuto to detect potential introgression events. First, because the S. taiwanensis (TAI) lineage was genetically admixed with S. playfairii (PLA) and S. indica(IND), we investigated the evolutionary relationship of these three species by focusing on the speciation scenario of TAI (i.e., the TAI-IND-PLA set). Because the three species are not closely phylogenetically related and TAI contains genetic components that are absent from the other two species, we considered the effect of gene flow from a ghost (unsampled or extinct) lineage to IND or PLA. Regarding possibleex situ sources, the ghost lineage can also be taken as the hypothetical progenitor of TAI. We proposed three speciation scenarios (Figure S1 A-C): Recentghost (M1), recent and independent hybridization between IND and PLA with the ghost lineage; Ancientghost (M2), ancestral hybridization with the ghost lineage; and Noghost (M3), ancestral polymorphism. Second, we investigated another genetically admixed species, S. hsiehii (HSI), from S. indica lineage 1 (IND1) and S. austrotaiwanensis (AUS) (i.e., the HSI-AUS-IND set). We simulated comparative scenarios of the speciation process betweenS. hsiehii (HSI), S. indica lineage 1 (IND1), and S. austrotaiwanensis (AUS). Three models were designed (Figure S1 D-F): Hybrid (M1), HSI originated from hybridization between IND1 and AUS; AUSdiv (M2), the speciation of HSI from AUS; and INDdiv (M3), the speciation of HSI from IND1. A ghost lineage was also added to the outgroups in all simulated models. All models were simulated and optimized using the composite likelihoods implemented in fatsimcoal2 (Excoffier et al., 2021), an Approximate Bayesian computation (ABC) approach. Observed site frequency spectra (SFSs) were generated from easySFS using the SNP data set. Because the computation of fastsimcoal2 is demanding, observed SFSs were projected to reduce the sample size while obtaining 90%–95% of the maximum segregating sites.
The best model in each evolutionary scenario was selected according to three criteria. First, the Akaike information criterion (AIC), which represents the difference between the observed and expected SFSs, was used to rank the optimized models. Second, the 2D-SFS fit obtained from the optimized parameters in simulations was visually compared to the observed SFS. Third, the parametric-bootstrap approach was used to calculate 100 likelihoods and the 95% CI for the parameters estimated by the best models and to select the best-fitting models. The priors and tested scenarios are given in Supplementary file. The time of divergence between S. indica and the ancestral lineage containing S. austrotaiwanensis and S. playfairii was calibrated to 200 kyr (Chiang, Huang, & Liao, 2012), with a generation time of one year.
Outlier detection and functional annotation of adaptive genes
The genetic signal of adaptation was detected using two methods: (1) an approach that identifies SNPs with significantly high or low genetic differentiation compared to the overall population structure (F ST outliers) implemented in BayeScan (Foll, 2012) and (2) a landscape genomic method that uncovers the molecular mechanism of adaptation based on genotype-environment association (GEA) analysis applied in BayEnv2 (Feng & Du, 2022; Günther & Coop, 2013). In brief, BayeScan was performed with a parameter of posterior probability = 100 and detected loci that showed higher genetic differentiation than the q-value threshold < 0.05 (Foll, 2012). BayEnv2 was used to estimate the covariance of allele frequencies between populations, and the model was then used as a null hypothesis for detecting SNP sites associated with environmental variables (Günther & Coop, 2013).
To detect loci strongly associated with the most important axes for rearranging genetic components according to environmental factors, the RDA-based method was used (Capblancq et al., 2018). To minimize false positives, demography and geography were accounted for in the RDA-based method by implementing two separate RDA-based strategies as described in Capblancq et al. (2023): the first kept climate variables as predictors without conditioning other factors (RDA-raw), whereas the second kept climate variables as predictors while using the first two genetic PC axes from the genetic PCA and the first three distance-based Moran’s eigenvector maps (dbMEMs) as conditioning variables (RDA-correct). Loci shared by at least two of the four scanning methods were retained as loci putatively under selection by the environment.
A hierarchical approach was used for functional annotation of the outlier loci. First, KOBAS-I (Bu et al., 2021) was used to annotate the trimmed nucleotide sequences of all retained genes using the reference genome of Arabidopsis thaliana with a threshold of e-value < 10-5. Second, the complete amino acid sequences of the orthologous genes retrieved from the genome annotation of S. baicalensis for all unmapped genes in the first round were retrieved and functionally annotated using KOBAS-I in the same manner as in the first round.