Reference assembly, bioinformatic filtering and genotype calling
Custom perl scripts were used for processing the resulting sequences (RADTOOLKIT v. 0.13.10, available in Dryad Digital Repository). Raw fastq reads were demultiplexed using a maximum of one mismatch and removed if expected cut sites were not found. Resulting demultiplexed reads were trimmed of Illumina adapter contaminations and low-quality reads using cutadapt (Martin, 2011) and Trimmomatic (Bolger, Lohse and Usadel, 2014). Cleaned reads were clustered with CD-HIT (Li and Godzik, 2006; Fu et al. , 2012), with a minimum support per cluster set at three reads, and representative sequences retained for each cluster. RepeatMasker (http://repeatmasker.org/) was used to mask putative repetitive elements, low complexity regions and short repeats using ’Mytilidae’ as a database (Smit, Hubley and Green, 2014). Loci were discarded if >60% of nucleotides per loci were Ns. The resulting RAD loci were combined for all individuals, and a reference was built from loci shared by at least 70% of individuals.
Cleaned sequence reads for each individual were aligned to the de novo­ generated reference separately using Novoalign (http://www.novocraft.com), and only uniquely mapping reads were retained. Picard (http://www.picard.sourceforge.net) was used to add read groups, SAMtools (Li et al. , 2009) to generate a BAM file per individual, and GATK (McKenna et al. , 2010) to perform realignment. SAMtools and BCFtools were used to generate a VCF file. Only monoallelic and biallelic sites were retained. Single Nucleotide Polymorphisms (SNPs) and invariant sites were masked around 10bp of an indel. Sites were removed if the depth was outside 1st and 99th percentile of the overall coverage. Another custom perl script (SNPcleaner, github.com/tplinderoth/ngsQC/tree/master/snpCleaner (Biet al. , 2013, 2019)) was implemented for further filtering of SNPs.
Calling SNPs and genotypes based on allele counts may be highly uncertain if coverage is low (Johnson and Slatkin, 2008; Lynch, 2008), which subsequently may bias downstream analyses. Therefore, we compared results from genotype calls and genotype likelihoods. Genotype likelihoods were generated via an empirical Bayesian framework via Analysis of Next-Generation Sequencing Data (ANGSD) (Korneliussen, Albrechtsen & Nielsen, 2014). We set genotype posterior probabilities of 0.95 as a threshold in ANGSD to output high-confidence genotypes for analyses performed in GENODIVE requiring genotype calls (Meirmans and Van Tienderen, 2004). For downstream analyses based on either genotype likelihoods and genotype calls we tested the effect of coverage (3X and 10X) and missing data included (max. 30%, 10%, 5% and 1% allowed missing data).