Introgression, incomplete lineage sorting (ILS) and hybrid
speciation
TreeMix (Pickrell & Pritchard, 2012) was used to reconstruct the
evolutionary relationships among species using a maximum likelihood tree
and accounting for potential introgression events. We constructed
consensus trees from m = 1 to m = 10 with 100 bootstrap
replicates; 10 independent runs were executed for each value ofm . We estimated the optimal number of migration events using the
R package OptM (Fitak, 2021) by assessing changes in likelihood
(Δm ) across incremental values of m . The final ML tree
with the best m and bootstrap values was visualized according to
scripts written by Zecca, Labra, and Grassi (2020).
To further detect the footprints of introgression, we attempted to
identify hybrid nodes in the phylogenetic networks based on the species
tree (see details in the Supplemental Materials and Methods) using
PhyloNetworks (Solís-Lemus, Bastide, & Ané, 2017). To reduce the
computational burden, the analyses were run using a subset of 70
individuals randomly selected from each clade and population. Quartet
concordance factors (CF) were calculated from SNP matrices using the R
function SNPs2CF (Olave & Meyer, 2020). The best number
for
hybrid nodes was searched from hmax = 1 to hmax = 5, and
the scenario with the lowest negative log pseudo-likelihood was
evaluated using bootstrap analysis with the bootsnaq function in
PhyloNetworks with 10 runs per replicate and 100 bootstrap iterations.
We then aimed to detect the specific introgression signal by estimating
Peterson’s D (ABBA-BABA) and related statistics in Dsuite
(Malinsky, Matschiner, & Svardal, 2021). D and f4statistics were estimated using all possible combinations of trios
within the phylogenetic relationships of the species tree. We used thef -branch test (fb ) in Dsuite to allocate gene flows
between specific branches of the species tree, addressing false
positives from D -statistics comparisons and ancient
introgression. The fb test was also performed to detect the
introgression signal in each gene.
Because of the difficulty of distinguishing recent introgression from
ILS, we used QuIBL (Edelman et al., 2019), a tree-based method that
compares models with ILS and introgression with models with only ILS to
obtain localized information on gene flow. We implemented BIC values to
select the ILS-only model or alternative ILS and introgression model for
each tested triplet. One individual with the highest coverage from each
species was kept, and the sequences of the 52 LCN genes were used as the
input for QuIBL. The percentages of gene sequences and SNPs showing
evidence for introgression were also calculated.
We built and examined two independent sets of evolutionary scenarios
revealed by PhyloNetworks and StrAuto to detect potential introgression
events. First, because the S. taiwanensis (TAI) lineage was
genetically admixed with S. playfairii (PLA) and S. indica(IND), we investigated the evolutionary relationship of these three
species by focusing on the speciation scenario of TAI (i.e., the
TAI-IND-PLA set). Because the three species
are not closely phylogenetically
related and TAI contains genetic components that are absent from the
other two species, we considered the effect of gene flow from a ghost
(unsampled or extinct) lineage to IND or PLA. Regarding possibleex situ sources, the ghost lineage can also be taken as the
hypothetical progenitor of TAI. We proposed three speciation scenarios
(Figure S1 A-C): Recentghost (M1), recent and independent hybridization
between IND and PLA with the ghost lineage; Ancientghost (M2), ancestral
hybridization with the ghost lineage; and Noghost (M3), ancestral
polymorphism. Second, we investigated another genetically admixed
species, S. hsiehii (HSI), from S. indica lineage 1 (IND1)
and S. austrotaiwanensis (AUS) (i.e., the HSI-AUS-IND set). We
simulated comparative scenarios of the speciation process betweenS. hsiehii (HSI), S. indica lineage 1 (IND1), and S.
austrotaiwanensis (AUS). Three models were designed (Figure S1 D-F):
Hybrid (M1), HSI originated from hybridization between IND1 and AUS;
AUSdiv (M2), the speciation of HSI from AUS; and INDdiv (M3), the
speciation of HSI from IND1. A ghost lineage was also added to the
outgroups in all simulated models. All models were simulated and
optimized using the composite likelihoods implemented in fatsimcoal2
(Excoffier et al., 2021), an Approximate Bayesian computation (ABC)
approach. Observed site frequency spectra (SFSs) were generated from
easySFS using the SNP data set. Because the computation of fastsimcoal2
is demanding, observed SFSs were projected to reduce the sample size
while obtaining 90%–95% of the maximum segregating sites.
The best model in each evolutionary scenario was selected according to
three criteria. First, the Akaike information criterion (AIC), which
represents the difference between the observed and expected SFSs, was
used to rank the optimized models. Second, the 2D-SFS fit obtained from
the optimized parameters in simulations was visually compared to the
observed SFS. Third, the parametric-bootstrap approach was used to
calculate 100 likelihoods and the 95% CI for the parameters estimated
by the best models and to select the best-fitting models. The priors and
tested scenarios are given in Supplementary file. The time of divergence
between S. indica and the ancestral lineage containing S.
austrotaiwanensis and S. playfairii was calibrated to 200 kyr
(Chiang, Huang, & Liao, 2012), with a generation time of one year.
Outlier detection and functional annotation of adaptive
genes
The genetic signal of adaptation was detected using two methods: (1) an
approach that identifies SNPs with significantly high or low genetic
differentiation compared to the overall population structure
(F ST outliers) implemented in BayeScan (Foll,
2012) and (2) a landscape genomic method that uncovers the molecular
mechanism of adaptation based on genotype-environment association (GEA)
analysis applied in BayEnv2 (Feng & Du, 2022; Günther & Coop, 2013).
In brief,
BayeScan
was performed with a parameter of posterior probability = 100 and
detected loci that showed higher genetic differentiation than the
q-value threshold < 0.05 (Foll, 2012). BayEnv2 was used to
estimate the covariance of allele frequencies between populations, and
the model was then used as a null hypothesis for detecting SNP sites
associated with environmental variables (Günther & Coop, 2013).
To detect loci strongly associated with the most important axes for
rearranging genetic components according to environmental factors, the
RDA-based method was used (Capblancq et al., 2018). To minimize false
positives, demography and geography were accounted for in the RDA-based
method by implementing two separate RDA-based strategies as described in
Capblancq et al. (2023): the first kept climate variables as predictors
without conditioning other factors (RDA-raw), whereas the second kept
climate variables as predictors while using the first two genetic PC
axes from the genetic PCA and the first three distance-based Moran’s
eigenvector maps (dbMEMs) as conditioning variables (RDA-correct). Loci
shared by at least two of the four scanning methods were retained as
loci putatively under selection by the environment.
A hierarchical approach was used for functional annotation of the
outlier loci. First, KOBAS-I (Bu et al., 2021) was used to annotate the
trimmed nucleotide sequences of all retained genes using the reference
genome of Arabidopsis thaliana with a threshold of e-value
< 10-5. Second, the complete amino acid
sequences of the orthologous genes retrieved from the genome annotation
of S. baicalensis for all unmapped genes in the first round were
retrieved and functionally annotated using KOBAS-I in the same manner as
in the first round.