Bioinformatics
The sequencing yielded on average 87.3 million paired-end sequences per
sample (range 77.7–97.8 million sequences). Illumina adapters were
removed with SeqPrep 1.2 (St John, 2014) following default settings with
parameters -A and -B targeting the adapter sequences with identical
selection. Any remaining PhiX sequences in the raw data were removed by
mapping the reads using bowtie2 2.3.4.3 (Langmead & Salzberg, 2012)
against the PhiX genome (NCBI Reference Sequence: NC_001422.1). Final
quality trimming of the data was conducted with Trimmomatic 0.36
(Bolger, Lohse, & Usadel, 2014) with the following parameters:
LEADING:20 TRAILING:20 MINLEN:50. The final quality of the trimmed reads
were checked with FastQC 0.11.5 (Andrews, 2010) and MultiQC 1.7 (Ewels,
Magnusson, Käller, & Lundin, 2016). On average 86.8 million sequences
remained (range 77.3–97.2 million sequences) with a Phred quality score
of 36–37 per base, and an average read length of 144 bp (range 139–147
bp).
Small subunit (SSU) 18S rRNA sequences were extracted from the quality
trimmed data using SortMeRNA 2.1b (Kopylova, Noé, & Touzet, 2012) with
the databases supplied with the software. Taxonomic classification was
conducted with Kraken2 2.0.7 (D. E. Wood et al., 2019) using paired-end
reads against the SILVA SSU r132 NR99 (Quast et al., 2013) (database
downloaded 1 March 2019) and NCBI NT database (database downloaded 12
March 2019). To estimate the relative abundance of each taxa Bracken 2.5
was used on the Kraken2 outputs with default settings on the genus level
(i.e. a count threshold of 10) (Lu et al., 2017). The Bracken output
reports were combined into a biom-format file with the python package
kraken-biom 1.0.1 (using parameters: —fmt hdf5 -max D –min G),
and the biom-format file was converted to a tax table using the python
package biom-format 2.1.7 (McDonald et al., 2012). The final data was
normalized as relative abundances (%), and the data was analyzed in the
software Explicet 2.10.5 (Robertson et al., 2013). Results for i)Nematoda (NCBI NT classifications, on average 478,331 sequences
per sample); ii) Foraminifera (NCBI NT, average 13,913
sequences), and iii) Ciliophora (SILVA, average 774,027
sequences) were extracted and analyzed separately. The NCBI NT database
was used for the Nematoda and Foraminifera data because,
1) the SILVA database is known to contain errors in the nematode
classifications (Elias Broman et al., 2019; Holovachov, Haenel, Bourlat,
& Jondelius, 2017), and the NCBI NT has previously been used to discern
differences in nematode communities on a spatial scale in the Baltic Sea
(Elias Broman et al., 2019); and 2) the SILVA database gave inaccurate
classifications for Foraminifera , resulting in the identification
of taxa never discovered in the Baltic Sea (more details in the
discussion).