Reference assembly, bioinformatic filtering and genotype calling
Custom perl scripts were used for processing the resulting sequences
(RADTOOLKIT v. 0.13.10, available in Dryad Digital Repository). Raw
fastq reads were demultiplexed using a maximum of one mismatch and
removed if expected cut sites were not found. Resulting demultiplexed
reads were trimmed of Illumina adapter contaminations and low-quality
reads using cutadapt (Martin, 2011) and Trimmomatic (Bolger, Lohse and
Usadel, 2014). Cleaned reads were clustered with CD-HIT (Li and Godzik,
2006; Fu et al. , 2012), with a minimum support per cluster set at
three reads, and representative sequences retained for each cluster.
RepeatMasker (http://repeatmasker.org/) was used to mask putative
repetitive elements, low complexity regions and short repeats using
’Mytilidae’ as a database (Smit, Hubley and Green, 2014). Loci were
discarded if >60% of nucleotides per loci were Ns. The
resulting RAD loci were combined for all individuals, and a reference
was built from loci shared by at least 70% of individuals.
Cleaned sequence reads for each individual were aligned to the de
novo generated reference separately using Novoalign
(http://www.novocraft.com), and only uniquely mapping reads were
retained. Picard (http://www.picard.sourceforge.net) was used to add
read groups, SAMtools (Li et al. , 2009) to generate a BAM file
per individual, and GATK (McKenna et al. , 2010) to perform
realignment. SAMtools and BCFtools were used to generate a VCF file.
Only monoallelic and biallelic sites were retained. Single Nucleotide
Polymorphisms (SNPs) and invariant sites were masked around 10bp of an
indel. Sites were removed if the depth was outside 1st and 99th
percentile of the overall coverage. Another custom perl script
(SNPcleaner, github.com/tplinderoth/ngsQC/tree/master/snpCleaner (Biet al. , 2013, 2019)) was implemented for further filtering of
SNPs.
Calling SNPs and genotypes based on allele counts may be highly
uncertain if coverage is low (Johnson and Slatkin, 2008; Lynch, 2008),
which subsequently may bias downstream analyses. Therefore, we compared
results from genotype calls and genotype likelihoods. Genotype
likelihoods were generated via an empirical Bayesian framework via
Analysis of Next-Generation Sequencing Data (ANGSD) (Korneliussen,
Albrechtsen & Nielsen, 2014). We set genotype posterior probabilities
of 0.95 as a threshold in ANGSD to output high-confidence genotypes for
analyses performed in GENODIVE requiring genotype calls (Meirmans and
Van Tienderen, 2004). For downstream analyses based on either genotype
likelihoods and genotype calls we tested the effect of coverage (3X and
10X) and missing data included (max. 30%, 10%, 5% and 1% allowed
missing data).