1 Introduction
Trees profoundly influence global carbon cycles and ecosystem stabilization [1-3]. The effects of anthropogenic climate change on precipitation and temperature patterns have altered the distribution, community composition, and phenology of forest trees worldwide [4-7]. However, the impacts of climate change on focal species vary [8-12]. For example, warming temperatures may have a positive effect on the growth rate of trees from colder environments but a negative effect on trees from tropical regions [10, 13]. Higher temperatures and increased aridity are predicted to increase stress reactions in forest trees under drought conditions, especially tropical and subtropical tree species [14-17].
Studies of the potential response to climate change based on field experiments have emphasized the importance of phenotypic variation for adaptation to local climate and acclimation to drought and cold stresses [18, 19]. However, field experiments are unrealistic for species with long lifespans, such as trees, or endangered species with limited populations. An alternative approach is genotype–environment association (GEA) analysis, which identifies environmental factors that select for genetic characteristics [20, 21]. By predicting the vulnerability of forest trees to climate change and environmental factors that may restrict future distribution, GEA analyses can provide a foundation for conservation projects and forest management [22-24].
Although GEA studies have implied that standing genetic variation may enable some tree species to cope with climate change, the long lifespans and low germination rates of tree species may limit the pace of adaptation to acute and drastic environmental changes [25]. Moreover, the applicability of the results of GEA studies to conservation projects may be confounded by several factors. For example, the geographical distance between source and sink populations and the composition of natural barriers to gene flow may influence pollen dispersal direction and germination rates for introgression between populations [26]. These factors could reduce the spread of beneficial alleles (i.e., genetic rescue effects) [27]. Incorporating landscape analyses into GEA studies could improve the evaluation of genetic vulnerability, assessment of vulnerable populations, and inference of potential dispersal routes under climate change [28, 29].
The tree family Fagaceae is ecologically and economically important and has been widely studied in landscape genetics research [30, 31]. The varied natural habitats of Fagaceae species offer ideal study systems for exploring the effects of environment on genetic diversity, local adaptation, and the response to climate change [32, 33]. Studies have shown that environmental factors, such as precipitation and temperature, significantly affect the adaptive divergence of Fagaceae species [34], but the impacts of other abiotic factors, such as wind, topology, and soil, have largely been ignored [35, 36]. Field experiments have shown that wind and soil factors are prominent drivers of local adaptation of plant morphology and physiology [37, 38], but the potential impacts of these factors on genomic architecture and phenotypic variation across the heterogeneous landscape of Fagaceae species have not been comprehensively established [19, 37, 39, 40]. This is particularly true for Fagaceae species on the subtropical island of Taiwan, where the mountainous terrain and diverse climate have created a range of habitats [41-43].
The rapid development of novel analytic methods in landscape genomics provides unprecedented opportunities to examine new hypotheses and assess vulnerable populations using different statistical assumptions in the context of climate change [2]. If climate change disrupts the allele frequencies that underlie current genetics–environment relationships, vulnerable populations may become less resilient or even extinct locally [44-47]. In this study, we adopted Quercus longinux Hayata, an endemic Fagaceae species in Taiwan, as the study species to evaluate the complex effects of environment on local adaptation and the response to climate change. Q. longinux inhabits mountainous ranges across Taiwan, from low to middle elevations (500 m to 2,200 m above sea level) [48]. Based on its wide distribution along latitudinal and altitudinal gradients, Q. longinux is classified into three varieties (i.e., var. longinux, var. kuoi, and var. lativiolaciifolia), and large morphological variations in fruits and leaves between habitats have been observed [48]. Q. longinux var. longinux and var. lativiolaciifolia grow sympatrically at low to middle elevations in Taiwan. By contrast, Q. longinux var. kuoi is allopatric with the other varieties and is limited to southeastern Taiwan. Compared with the other two varieties, Q. longinux var. kuoi has longer, broader leaves that are green on both sides when fresh and have a non-violet abaxial surface when dried [48]. Environmental variations may influence several leaf traits associated with photosynthetic efficiency and acclimation to abiotic stresses [49-51]. Identifying variations in not only genetic data but also leaf morphology that are associated with environment could shed light on the processes that gave rise to the unique Q. longinux var. kuoi in southern Taiwan.
This work was guided by three objectives. First, we aimed to infer genetic structure and assess the consistency of relationships between genetic data, ecological niches, and phenotypic traits for Q. longinux var. kuoi, a special allopatric variety limited to southern Taiwan. We hypothesized that the distinct habitats in southern Taiwan influence population structures and result in differentiated genetic characteristics that correspond with morphological traits. Second, we aimed to identify environmental features that affect spatial genetic variation and adaptive genetic divergence. We hypothesized that classic temperature and precipitation factors influence genetic divergence and that wind- and soil-related factors participate in adaptation to heterogeneous habitats in Taiwan. Third, we aimed to use adaptive genetic variants to evaluate the vulnerability of populations to climate change. We hypothesized that edge populations and populations subject to the most severe environmental changes will experience the highest genetic offsets and face the highest risk of local extinction.
2 Materials and methods
2.1 Sampling, sequencing, read mapping, and variant calling
We collected 26 populations spanning all known distributional regions of Q. longinux in Taiwan (Fig. S1; Table S1). From each selected tree, a branch with more than ten mature leaves was collected for morphological measurements. Fresh leaves were stored in silica gel at 4°C until DNA extraction.
DNA was extracted using the modified CTAB method based on Doyle [52]. DNA quality and quantity were evaluated with a NanoDrop 2000 Spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA) and Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, United States), respectively. After quality control, DNA from 205 individuals was used for dd-RAD library construction. The library was double digested by Sbf1 and Msp1 and ligated to Illumina sequencing adapters with individual barcodes and library indices. Fragments of 250-500 bp were selected and amplified by polymerase chain reaction (PCR). Finally, the fragments were sequenced using Illumina HiSeq X (Illumina, USA) to generate 150-bp paired-end reads.
We trimmed the raw reads to remove adapters, reads < 50 bp, and reads with quality < 30 using fastp [53]. The clean reads were mapped to the reference genome of Q. robur [54] using the mem algorithm of BWA [55]. Aligned reads with a mapping quality score < 20 were discarded. PCR duplicates were marked with the Picard Toolkit (http://broadinstitute.github.io/picard/). We further performed realignment around indels using ABRA2 [56]. Summary statistics, including read number, mapping rate, and coverage, were calculated from SAMtools [57]. Variants were called with BCFtools [58]. We first generated genotype likelihoods at each genomic position with coverage using the BCFtools mpileup command and called the variants with default parameters. After removing indels, we retained biallelic SNPs with missing rates < 0.4, minor allele frequencies > 0.01, genotypes with quality scores >30, and mean minimum depths > 3 as filtering thresholds in VCFtools [59].
2.2 Genetic diversity, genetic structure, and demographic analysis
To evaluate and visualize the genetic clusters of Q. longinux, we first performed principal component analysis (PCA) using a genetic matrix in which missing data were replaced with mean values of each site in the R package adegenet [60]. Next, pairwise FST, population-specific FST, and permutational multivariate analysis of variance (PERMANOVA) were performed with the R packages hierfstat [61] and vegan [62] to evaluate differentiation among defined genetic clusters. A neighbor-joining (NJ) tree was also constructed based on the Nei distances from the R package ape [63]. We then employed StrAuto [64] analysis to assess the pattern of admixture among populations using 100,000 steps of MCMC chains with 25,000 steps burn-in and 10 iterations from K = 1 to 8. The results were uploaded to the CLUMPAK [65] server to generate consensus plots and evaluate the best K according to the ΔK value.
A stairway plot was used to infer historical changes in the effective population size (Ne) of
Q. longinux [66]. We first generated the site frequency spectra (SFSs) from easySFS (GitHub repo:
https://github.com/isaacovercast/easySFS) using all SNPs and selected the projection values with the highest segregating sites for each group. We then ran bootstrapping 1,000 times to estimate median sizes and 95% CIs, assuming a mutation rate of 1.01 × 10
-8 per bp per generation and a generation time of 50 years for oak species [67].
2.3 Environmental variables and GIS processing
Three categories of environmental variables were collected: 105 climate variables, eight soil variables, and four topological variables (Table S2). The climate variables included BIO01-19, seven monthly precipitation factors, temperature, solar radiation, and wind speed downloaded from WorldClim2 [68] and two aridity-associated layers [69]. Eight physicochemical soil characteristics downloaded from SoilGrids 2.0 were used as the soil variables [70]. For topography, the altitude layer was downloaded from GOV.DATA.TW (https://data.gov.tw/), and three altitude-derived layers (i.e., roughness, aspect, and slope) were computed with the R package raster [71]. All variables were resampled to a resolution of 1 km2 for the downstream analyses. To address the issue of multicollinearity, variables with VIF > 10 and Pearson correlation coefficient |r| > 0.8 were discarded, leaving 20 environmental variables among the three categories (Table S3; Supplementary data) for use in the subsequent analyses. We compared differences in the environmental variables between groups using analysis of variance (ANOVA) and Tukey's honestly significant difference test applied in the R package stats [72].
2.4 Assessment of ecological and morphogenetic divergence
2.4.1 Niche differentiation assessment
To harness the power of GEA analysis to evaluate a species’ ability to cope with future climate change, it is crucial to include non-genetic factors that may be associated with fitness to local environments, such as niche characteristics and phenotypic traits. To investigate and quantify the niche overlap and diversification between the three (eastern, western, and southern) genetic groups revealed in our results, we first assessed the overlap of the realized niche using the first two axes from the environmental PCA performed with the 20 retained environmental variables in the R package ecospat [73]. Five hundred pseudo-absence points were added to the analyses as background. In addition, we calculated two niche overlap indices: Schoener’s D and the standardized Hellinger-transformed Warren's I based on occurrence density grids. Next, we conducted the equivalency test with 100 permutations and the background test with 100 runs to evaluate the significance of niche differentiation at p < 0.05. The equivalency test and background test were implemented with the R package ENMtools [74].
2.4.2 Phenotypic trait analysis
To quantify the contributions of genetic and environmental factors to phenotypic variations, we measured 12 morphological leaf traits that are commonly used in morphological studies of Fagaceae and other plants (Table S3; Fig. S2; Supplementary data) [75-77]. A total of 826 dried leaves of 191 individuals from 26 populations (1–5 leaves per individual) were recorded with a digital camera and measured using ImageJ [78]. Because the measured traits may be highly correlated with leaf size, an additional six computational traits that are not typically correlated with leaf size were added [77]. We first performed PCA and PERMANOVA with the six computational traits (i.e., shape-related traits and specific leaf area (SLA)) and abaxial surface color to quantify morphological differences among groups using the R package vegan [62]. Biplots of the attributes and contributions were visualized with the R package factoextra [79]. Differences in traits between groups were compared using ANOVA and Tukey's post hoc test. Next, we performed partial redundancy analysis (RDA) to evaluate the contributions of geography, demography, and environment to phenotypes using all scaled phenotypic traits as the response.
To further investigate the associations between morphological variables and environmental predictors, we constructed univariate generalized linear models (GLMs) with one of the leaf traits as the response (Table S4) and one of the environmental variables as the predictor in the R package MASS [80].
2.5 Detection of local adaptation and underlying gene functions
2.5.1 Identification of environment-associated genetic variants
We integrated two FST-based outlier detection methods and one genotype–environment association (GEA) approach to detect environment-associated SNPs. First, we implemented BayeScan [81] and PCAdapt [82] to identify SNPs with a significant departure of allele frequency from neutrality while controlling for background genetic structure. BayeScan was run with posterior odds (PO) = 100, and SNPs with q < 0.01 were retained as putative outliers. PCAdapt was performed with the number of K principal components (PCs) selected by the decreasing order of the percentage of variation explained by each PC. We further implemented Benjamini–Hochberg correction [83] on the results and retained the SNPs with a false-discovery rate (FDR) < 10% as candidate outliers.
Second, we utilized a univariate approach, the latent factor mixed models (LFMM) algorithm [84, 85], to discover SNPs that were significantly correlated with environmental variables while accounting for population genetic structure. Imputation of missing data and determination of latent factors (K) were performed by the snmf function in the R package LEA [84]. LFMM was performed in the R package LEA [84] using the retained environmental variables with ten runs, and Z-scores from each run were combined. To further adjust the p-values, the Benjamini-Hochberg procedure [83] was used, and FDR < 10% was used to identify putative adaptive outliers. SNPs that overlapped between significant environment-associated outliers and the BayeScan or PCAdapt outliers were regarded as putative adaptive SNPs for use in downstream analyses.
To investigate and compare the roles of geography, demography, and environment in shaping the genetic variation of adaptive and all SNPs, we decomposed the relative contributions of each group of predictors using three methods. First, Mantel tests were used to test for associations between FST/(1-FST) and geographic distance (isolation by distance, IBD) and environmental distance (isolation by environment, IBE) using the R package vegan [62]. Environmental distances were represented by the Euclidean distances of scaled climatic, soil, and topographical variables. Values of pairwise FST between populations were calculated with the R package hierfstat [61]. Differences in FST between adaptive and all SNPs were compared by ANOVA in the R package stats [72].
Second, we performed partial redundancy analysis (RDA) using three predictor datasets: (1) proxies of geographic structure obtained by converting the geographic coordinates to uncorrelated axes using principal coordinates of neighborhood matrices (PCNM) in the R package vegan [62] and retaining half of the positive axes; (2) proxies of demographic history obtained by performing PCA using the genetic matrix of all SNPs and retaining the PC scores from PC1 to PC2; and (3) the 20 retained environmental variables. The genetic metrics calculated from the adaptive or all SNPs were used as the response variables. To further quantify the variation explained by each group of environmental variables, we performed another partial RDA using the 20 environmental variables, which were classified into (1) climatic, (2) soil, and (3) topographical variables, as predictors of the two genetic metrics as the response. The significance of the full models and pure fractions was assessed using 999 permutations with the function anova.cca() of the R package vegan [62].
Third, we used GradientForest (GF), a non-linear and machine-learning approach derived from the random forest algorithm [86]. GF was performed with the genetic metrics of the adaptive or all SNPs. The number of regression trees was set to 500, the maximum number of splits was set using the formula log2 (0.368 × number of individuals/2), and the correlation threshold r was set to 0.5, as suggested in a previous study [86]. The importance of each group of predictors was evaluated by the percentage of weighted R2.
2.5.2 Gene annotation and enrichment analysis
To further investigate the potential gene functions of putative environment-associated SNPs, we retrieved the closest genes with a physical distance < 10 Kbps as potential underlying genes using BEDOPS [87]. The closest genes were further annotated to the Arabidopsis thaliana genome using KOBAS-I [88] with the criterion of e-value < 10-5. After retrieving the homologous genes, gene enrichment analysis was performed with the GO and KEGG databases to detect enriched pathways using KOBAS-I. Significance was assessed with Fisher's exact test [89], and FDR was corrected with the Benjamini and Hochberg procedure [83]. FDR < 5% indicated significantly enriched GO terms or pathways.
2.6 Investigating the influences of current landscape variables on genetic differentiation
To compare landscape characteristics that shaped the differentiation of adaptive or all SNPs, we constructed IBD, IBE, and isolation by resistance (IBR) models using FST/(1-FST) calculated from adaptive or all SNPs as a response. Six predictor matrices were generated, including a geographical Euclidean distance metric (IBD), three environmental Euclidean distances calculated from the climate, soil, and topology factors (IBEclimate, IBEsoil, and IBEtopology), and two circuit-theory-based resistance layers calculated with CIRCUITSCAPE [90] using distribution maps of ecological niche modeling from the factors of climate and topography (e.g., IBRclimate, IBRtopography).
We used two complementary approaches to compare the models. First, reciprocal causal modeling (RCM) [91] was performed to evaluate the relative importance of each model based on the relative values of partial Mantel tests estimated from each focal and alternative model. All partial Mantel tests were computed using the R package vegan [62] with 999 permutations. Second, we implemented the maximum likelihood population mixed-effects model (MLPE) [92] to compare each predictor based on the AIC values. Models with DAIC > 2 were considered to have different model fits. MLPE was implemented with mlpe_rga in the R package ResistanceGA [93]. In addition, we performed an estimated effective migration surfaces (EEMS) analysis [94] to visualize geographic regions deviating from the assumption of IBD. Missing data were imputed and converted to genetic distance using the function bed2diffs. Next, EEMS was executed with RunEEMS_SNPS with a Deme number of 500. A total of 500,000 Markov chain Monte Carlo (MCMC) iterations were run after a burn-in of 150,000 and a thinning interval of 9,999.
2.7 Investigating the impacts of future climate change on genetic adaptedness
2.7.1 Ecological niche modeling
To investigate the current potential distribution range of Q. longinux, we performed ecological niche modeling (ENM) using an ensemble approach based on the true skill statistic (TSS)-weighted combination of six methods in the R package sdm: maxent, glm, svm, gam, mda, and mlp [95]. Five hundred pseudo-absence data were generated and added to the analyses. A total of 60 runs with 10-fold cross-validation analyses were performed. We then used the area under the receiver-operator curve (AUC) to evaluate model performance. Finally, the habitat prediction was transformed into a binary map that classified suitable and unsuitable regions using the threshold of maximum test sensitivity plus specificity.
2.7.2 Genetic offset assessment
We applied five temperature and precipitation variables (i.e., BIO1, BIO3, BIO12, PREC4, and PREC10) selected from the obtained factors with accessible prediction layers to assess the influence of climate change on Q. longinux. We downloaded and averaged the predictions from three future models (CCSM4, MIROC-ESM, and MIROC5) in 2070 to account for model variability. We also considered two contrasting representative concentration pathways (RCPs): a low-emission model (RCP2.6) and a high-emission model (RCP8.5) from CMIP5 for 2070. Three complementary approaches were used to evaluate genetic offset. We first calculated the risk of non-adaptedness (RONA), which represents the theoretical changes needed in allele frequency to maintain the linear environment-genotype relationships with correction on weighted R2, implemented in pyRONA [96]. We retained the top three representative climatic variables with the highest number of significant outliers.
Second, as a complementary method to RONA, we used a random forest algorithm to model the non-linear relationships between adaptive SNPs and the same climatic variables as RONA in the R package GradientForest [86]. PCA was performed on the GF model to visualize the prediction of genetic variation in spatial regions, and the first three principal components (PCs) were assigned to a red–green–blue palette. The genetic offset in the GF model was calculated as the Euclidian distance between the current and future genetic composition at each grid. The binary map generated by ENM was used as a mask to limit prediction within suitable habitats.
Third, following Gougherty, Keller [97], we integrated migration to predict maladaptation to future climate using a generalized dissimilarity modeling (GDM) algorithm [2] with local, forward, and reverse genetic offsets. Local offset represents the predicted in situ change in allele frequencies with no migration. Forward offset was calculated by selecting the minimum offset between each grid in the current range and all grids in the future climate, and reverse offset was calculated by identifying the minimum offset between each grid within the current range in the future and all grids in the current climate. For forward offset, the distance required to migrate to the place with minimum forward genetic offset and the direction (bearing) of migration were also recorded. No dispersal limitation was assumed for forward and reverse offsets to any locations in the suitable habitats. Finally, to simultaneously visualize local, forward, and reverse offsets, we mapped the three metrics as red, green, and blue in an RGB image.
3 Results
3.1 Population structure and demographic history
On average, 70.9% of reads per sample were successfully mapped to the reference genome of Q. robur. We recovered 624,451 SNPs and retained 1,933 high-quality SNPs with an average individual missing rate of 11%. StrAuto revealed three genetic clusters (Fig. 1a): the eastern and western clusters, which are mainly separated by the central mountain range (CMR), and a cluster in the Henchung Peninsula that was limited to southern Taiwan, denoted as HC. Multiple individuals showed admixture with other clusters (Fig. 1b). PCA and the neighbor-joining (NJ) tree produced consistent patterns (Fig. 1c; Fig. S3a-b). FST was higher between genetic clusters than within groups, and divergence was highest between HC and either the eastern or western cluster (Fig. S4c). Similar results were obtained by PERMANOVA (Table S5). According to the stairway plot (Fig. S4d), Ne has differed between HC and the other two clusters since 1 million years ago (Fig. S4d). The apparent population decline of HC 1 million years ago was preceded by a period of larger population size. By contrast, the eastern and western clusters underwent a relatively recent population decline in the last 100,000 years.