2.4 Bioinformatics pipeline and SNP filtering
Raw sequence data were processed using Stacks v2.62 (Catchen et al.,
2013) and aligned to the genome with BWA v0.7.17-r1188 (Li & Durbin,
2009). Full details of the bioinformatics pipeline, which produced a
variant call format (VCF) file containing 45,488 variants for SNP
filtering in R v4.0.3 (R Core team 2020) are provided in Supplementary
file S1. We filtered genotyped variants using the “SNPfiltR” v1.0.0
package (DeRaad, 2022) based on (i) minimum read depth (≥ 5), (ii)
genotype quality (≥ 20), (iii) maximum read depth (≤ 137), and (iv)
allele balance ratio (0.2 – 0.8). Then, using a custom R script, we
filtered SNPs based on (i) the level of missing data (< 5%);
(ii) minor allele count (MAC ≥ 3), (iii) observed heterozygosity
(< 0.6), and (iv) linkage disequilibrium (correlation
< 0.5 among loci within 500,000 bp).
To ensure that relationships between individuals could be accurately
inferred from the data, we used these SNPs and samples to construct a
hierarchical clustering dendrogram based on genetic distance, with
visual examination of the dendrogram confirming that all 24 replicates
paired closely together on long branches (Figure S1). We also made a
higher-level bootstrapped dendrogram by using genetic distances among
sampling localities instead of individuals (Figure S2). The percentage
difference between called genotypes of technical replicates was also
used to confirm that genotyping error rates were low after filtering
(mean 99.91% ± 0.005% SE similarity between replicates). We therefore
removed one of each replicate pair from all further analyses.
We used “tess3r” (Caye et al. 2016, 2018) to perform a genome
scan for loci under selection, using the Bejamini-Hochberg algorithm
(Benjamini & Hochberg, 1995), with a false discovery rate of 1 in
10,000 to correct for issues associated with multiple testing. Because
this method identified zero candidate loci under selection, we also used
the gl.outflank function in “dartR” v2.0.4 to implement the
OutFLANK method (Whitlock & Letterhos 2015) to infer the distribution
of FST for loci unlikely to be strongly affected by
spatially diversifying selection. This method also identified zero
putatively adaptive loci, leaving a final dataset for formal population
genetic analysis containing all 70 originally sampled individuals, 5,239 biallelic SNPs, and an overall missing data level of 0.98 %. The number
of SNPs and samples removed from the dataset at each filtering step is
provided in Table S2.