Materials and Methods
Plant material
For this study, we used the materials collected from 78 populations ofCeratonia siliqua across the Mediterranean basin as described in
Viruel et al. (2020). Based on field observations, the habitat of each
population was recorded as natural, semi-natural or cultivated. Habitats
where human impact is low and there is no evidence of recent land use
were recorded as ‘natural’. ‘Semi-natural habitats’ lack cultivation but
show the clear presence of human activities such as pasture, old farming
or sometimes scattered almond, carob or olive trees from which fruits
are occasionally harvested. Habitats containing carob trees in
specialized or consociated orchards (with cereals or other fruit trees)
―even if abandoned―, were recorded as ‘cultivated’.
Preliminary delimitation of Ceratonia siliqua units using
microsatellite data
We used microsatellite data for 17 loci and 56 localities across the
Mediterranean basin as described in Viruel et al. (2020). We selected
localities with at least ten carob genotypes per population (totalling
1020 trees). We also scored 15 single nucleotide polymorphisms present
in the flanking regions of the 17 SSR loci following Viruel et al.
(2018). We calculated genetic differentiation between localities using D
index (Winter 2012, Mmod R package) and converted it into Euclidean
genetic distances based on the coordinates of a NMDS (Non-metric
Multidimensional Scaling; vegan R package). The Euclidean genetic
distances and Euclidean geographical distances were then processed by
ClustGeo (Chavent et al. 2018; clustgeo R package), which relies on
Ward’s method to minimize intra-group variance for both geographical and
genetic distances. K and 𝛼 are the main parameters. K is the number of
clusters and 𝛼 is a mixing parameter determining the weight of the
spatial constraint. Several values of 𝛼 were tested to optimize the
spatial contiguity of populations without deteriorating genetic
differentiation structure. Several values of K were also tested. This
clustering method provided genetically homogeneous groups of spatially
adjacent carob populations, named “CEUs” (Carob Evolutionary Units),
which were used in subsequent analyses. A neighbor-joining tree based on
pairwise genetic differentiation (Gst, Mmod package) among the seven
CEUs was built to display the overall differentiation structure.
RADseq methodology and sequencing
CEUs were used to select 376 samples representative of the genetic
diversity and structure of C. siliqua for RADseq. Eight samples
of the only other species in the genus, C. oreothauma, were
added. Genomic library preparation and sequencing were conducted by
Microsynth ecogenics GmbH (Blagach, Switzerland). DNA samples (200-400
ng input) were digested with the restriction enzymes EcoRI/MSeI
following heat inactivation according to the manufacturer’s protocol
(New England Biolabs, NEB). Fragments between 500 and 600 bp were
selected by automated gel cut, Illumina Y-shaped adaptors were ligated,
and ligation products were bead purified. Each library was then
individually barcoded by PCR using a dual-indexing strategy.
Individually barcoded libraries were pooled and subsequently purified
before sequencing on an Illumina NextSeq platform (300 million of 75 bp
reads per run).
Bioinformatic pipeline to extract and filter SNPs from RADseq data
The bioinformatic approach used in this study is summarized in Fig. 1.
FASTQC reports (multiQC, Ewels et al., 2016) were used to exclude 26C. siliqua and 4 C. oreothauma samples due to low
sequencing coverage. The remaining 354 samples (350 C. siliquaand 4 C. oreothauma ) had an average of 3 million raw reads,
ranging between 0.7 and 15 million. Assembly (Fig. 1) was performed
using ipyrad (Eaton and Overcast 2020) in a high-performance computing
cluster (HPC pytheas). Thereafter, a “locus” is a RADseq marker of 65
bp, that resulted from the ipyrad workflow, and a “SNP” is a
polymorphic position of a specific locus, considering that a locus can
contain several SNPs. We conducted an assembly limited to C.
siliqua following two steps. First, we selected 36 samples
representative of the diversity in C. siliqua with high
sequencing coverage to conduct four de novo assemblies with
varying thresholds of clustering reads (clust_threshold ) between
0.9 and 0.96, the minimum number of samples per locus
(min_sample_locus ) fixed to 30, the minimum depth
(mindepth_statistical ) for base calling fixed to 8, the minimum
size of reads fixed to 50 and the first 5 bp of both ends of locus
trimmed. The maximum number of indels and the maximum percentage of SNPs
per locus were set to 1 and 10%, respectively. Default values were used
for all other parameters. Considering the number of loci,
heterozygosity, and error rates, we estimated an optimal clustering
threshold of 0.94 (see Tab. S1 supplementary material). A reference file
was generated by extracting the first sequence of each locus with
pyrad2fasta script (available on https://github.com/pimbongaerts).
Second, this file was used for a subsequent reference assembly for the
350 samples with the same parameters as previously except for the
minimum number of samples per loci, which was fixed to 45. The dataset
constructed using this reference assembly contained more than 64%
missing data. The vcf file obtained from ipyrad was then processed to
run a principal component analysis (PCA) using the glPca function of
adegenet R package, and a neighbor-joining tree, based on pairwise
genetic differentiation (Gst, Mmod R package) among the seven CEUs, to
check that the structure from RAD markers was congruent with previous
analysis based on microsatellite markers.
We used Matrix condenser software (de Medeiros and Farrell, 2018;
de Medeiros 2019) to visualize the samples causing this high missing
data rate. We filtered samples to optimize locus coverage and to keep
the sampling equilibrium among CEUs, and reconducted the reference
assembly with ipyrad on a new set of 190 samples. The data was then
reduced to one SNP by locus, based on two criteria: SNP having the
maximal minimum allelic frequency per locus and only keeping SNPs with
allelic frequency above 1.05 % (i.e., rarest allele present in at least
two individuals). To examine the carob genetic diversity based on
neutral processes, such as genetic drift and gene flow, we reduced the
effect of outlier loci (i.e. having unexpectedly high differentiation
among populations, which could have a great effect on the variance among
populations). We used Outflank software (Whitlock & Lotterhos, 2015),
which produces a lower false-positive rate compared to other methods
(e.g. Silliman 2019), considering CEUs as populations. The false
discovery rate (qvalue) was fixed to 0.05. BLAST searches using the
nucleotide NCBI database limited to flowering plant sequences were
conducted for outliers, which were mostly assigned to plastid (pDNA) or
mitochondrial (mtDNA) genomes. Nuclear sequences were then discarded
whereas pDNA and mtDNA sequences were used as a reference in a new
ipyrad assembly, with adapted parameters for haploid data. The sequences
of these new sets of markers were concatenated to construct separate
alignments of mitochondrial and plastid haplotypes for the 190 carob
trees.
Hereafter, different analyses were conducted on three datasets: the
genome-wide nuclear data (GWN), was used for the analysis of genetic
differentiation, population divergence history and gene flow, whereas
the mitochondrial and plastid-based haplotypes datasets were used in
phylogenetic reconstructions (MH and PH datasets respectively).
Raw and filtered vcf files, as well as custom R scripts are available in
####: ####.
Population genomic analysis
The GWN dataset was converted from genind to genlight, vcf and treemix
data formats using dartR (Gruber et al., 2018) and radiator (Gosselin
2020) R packages. Population structure analysis was performed using snmf
(R LEA package, Frichot and François 2015) which estimates admixture
coefficients from the genotypic matrix assuming K ancestral populations.
Snmf runs fast even on large datasets and without loss of accuracy
compared to other Bayesian modelling software such as ADMIXTURE
(Alexander et al. 2009; Frichot et al. 2014). Snmf was run for K = 2 to
15, 100 repetitions, regularization parameter of 250 and 25% of the
genotypes were masked to compute the cross-entropy criterion. Barplots
showing ancestry coefficients were obtained with the compoplot function
in adegenet R package (Jombart and Ahmed 2011), with genotypes sorted
according to CEUs. To visualize the genetic diversity structure, we
performed a PCA using the glPca function of adegenet R package. Pairwise
differentiation among CEUs (Gst) were also computed with the Mmod
package. The R scripts of these analyses are available in #### web
site: ####.
A new reference assembly was performed to add C. oreothaumasamples to the GWN dataset and obtain one including the two species. As
previously, to obtain unlinked SNPs, we retained one SNP per loci,
choosing the one with the highest frequency. Using PAUP*4.0a (Swofford,
2018), we built a coalescent-based tree with the SDV quartet method
(Chifman & Kubatko 2014). Ten million randomly selected quartets were
analyzed and node support was assessed by 1,000 bootstrap replicates. We
used the ‘distribute’ option for heterozygous sites.
The GWN dataset without C. oreothauma genotypes was used to
perform a Treemix analysis (Pickrell and Pritchard 2012) to infer
population splits and mixtures across the evolutionary history of the
carob tree. Treemix builds a backbone tree based on population allelic
frequency without gene flow and then adds reticulate branches,
representing gene flow, aiming at improving the fit of the data. For
this analysis, we used the CEUs as populations. One of the CEUs was
fixed as an outgroup according to the results obtained in the
phylogenetic analysis (see SDV quartet method in Results). As
previously, only unlinked SNPs were used.
Assemblies based on MH and PH sequences failed to include C.
oreothauma . The MH and PH matrices were analyzed with IQTREE with the
default parameters. The F81+F+I and TN+F+I+G4 models were retained for
the MH and PH matrices, respectively.
Footprints of domestication were investigated by estimating genetic
diversity and/or the presence of candidate loci under selection
(outliers) for the CEUs. A stronger effect of domestication is expected
in cultivated habitats compared to natural habitats and also in the
eastern CEUs compared to the western ones. Therefore, the analyses were
organized according to combinations of these two factors. Genetic
diversity indexes (Hobs, Hexp, f and
rbardD) were estimated using hierfstats and poppr R
packages (Goudet 2005; Kamvar et al. 2014). Outliers were searched with
the Outflank method with a FDR threshold of 0.05 as described above when
building the GWN dataset considering combinations of habitats and CEUs
as populations.