Bioinformatics
Three different bioinformatics workflows (pipelines) were used to process raw paired-end Illumina data: 1) ESVs (exact sequence variants) pipeline as implemented in DADA2 (Callahan et al. 2016); 2) 95% OTUs (operational taxonomic units) pipeline, where OTUs are clustered at 95% sequence identity; and 3) a pipeline based on OTUs clustered at 97% identity. The processing of sequencing data to generate ESVs and 95% OTUs followed the workflows as described in Tapolczai, Keck, Bouchez, Rimet and Vasselon (2019) and Rivera, Vasselon, Bouchez and Rimet (2020), respectively, except that taxonomy assignment of the representative sequences was performed using blastn algorithm (instead of Naïve Bayesian classifier) (Camacho et al. 2009) with e-value = 0.001, word size = 7, reward = 1, penalty = -1, gap open = 1 and gap extend = 2 (against R-Syst v.7.2 diatom database (Rimet et al.2016)). Based on our positive and negative controls, the 95% OTUs data set was further filtered to exclude low occurrence (≤ 3) reads per OTU per sample to alleviate to ‘tag-switching’ error. The latter was not performed for the ESVs data set as no sequences were observed in the negative controls and no ‘read-leakage’ from the positive control sample. Singleton ESVs/OTUs were discarded from the data sets (i.e. ESVs/OTUs that had only one read across samples).
To generate 97% OTUs, raw paired-end Illumina sequencing data was processed in PipeCraft (Anslan, Bahram, Hiiesalu & Tedersoo 2017), which incorporates all the following tools (except LULU). Reads were assembled and quality filtered using vsearch (fastq_minoverlen 15, maxdiffs 45, fastq_minmergelen 200, fastq_maxee 1, fastq_maxns 0, fastq_truncqual 5, fastq_allowmergestagger) (Rognes, Flouri, Nichols, Quince & Mahé 2016). Chimera filtering step included vsearch uchime_denovo algorithm with options id 0.97 and abskew 2. The filtered sequences were clustered using UPARSE (Edgar 2013) with 97% similarity threshold and minimum cluster size of 2 (i.e. singletons excluded). The obtained OTU table was further curated with post-clustering algorithms as implemented in LULU (Frøslev et al. 2017) to merge consistently co-occurring ‘daughter’ OTUs (minimum_ratio = 1, minimum_ratio_type = “avg”, minimum_relative_cooccurence = 0.8, minimum_match = 96.97). Potential tag-switching errors were also corrected based on negative and positive controls based on relative abundances of sequences in the control samples (Taberlet, Bonin, Coissac & Zinger 2018). To account for unequal sequencing depth, we rarefied samples to common depth of 7000 sequences using the mothur software (Schloss et al. 2009). The latter removed five samples from the data set. Representative sequences per OTU were compared against R-Syst v.7.2 diatom database using blasn as stated above.
The used primers (rbcl-646F and rbcL-998R) amplify DNA also from other algae and bacteria (especially Proteobacteria and Cyanobacteria). To exclude the non-target taxa, only OTUs that demonstrated the match coverage and identity of ≥ 90% against a reference database, were considered as diatom OTUs and included in the final tables produced by each pipeline. According to additional blastn search against NCBI database (Geer et al. 2009), the above threshold was confirmed to include only diatom taxa into the final OTU table. OTUs with lower thresholds to reference diatom sequences (in R-Syst) were often more closely related (based on e-value, sequence similarity and coverage) to other micro-algae (e.g. taxa from Mischococcales, Tribonematales, Eustigmatophyceae), thus excluded from the downstream analyses.
Because of the uncertainty of the most adequate species-level sequence similarity threshold for diatoms, the taxonomic composition comparisons between metabarcoding treatments (HTS 10 and HTS 0.5) and microscopy was performed on genus level. Reliable genus level classification of the OTUs in the HTS data set was here defined when sequence similarity and coverage was ≥ 95% against a reference sequence in the R-Syst database. OTUs that displayed lower values against the R-Syst database sequences were blastn-searched against the NCBI database to check for additional genus-level annotations. Synonym names for genera were also explored in the case of mismatches between microscopy and metabarcoding data sets.