SWARM clustering analyses for MOTU identification
We applied a second bioinformatic workflow to cluster sequences into
taxonomic units without requiring a complete reference database to
estimate richness and MOTUs composition (Marques et al. 2020). We used
the sequence clustering SWARM algorithms that group multiple variants of
sequences into MOTU (Molecular Operational Taxonomic Units; Mahé et al.
2014, Rognes et al. 2016). Reads were assembled using VSEARCH (Rognes et
al. 2016), then demultiplex and trimmed using CUTADAPT (Martin 2013) and
clustering was performed using SWARM (Mahé et al. 2014) with a minimal
distance of 1 between each cluster. The clustering algorithm uses
sequence similarity and abundance patterns to delineate meaningful
entities, by grouping together sequence variants. Once MOTUs are
generated, the most abundant sequence within each cluster is used as a
representative sequence for taxonomic assignment. Then, a
post-clustering curation algorithm (LULU, Frøslev et al. 2017) was
applied to curate the data. The taxonomic assignment was performed using
the ECOTAG program against the NCBI database. The taxonomic level of
assignments was determined by the result of the ECOTAG program and the
percentage of similarity between the sequences in the sample and those
in the reference database. Taxonomic levels were corrected using the
same thresholds as the pipeline using the ObiTools. Cleaning filters
were then applied to remove sequences most likely corresponding to
errors and non-specific amplifications: (i) removal of amplicons with
less than 10 reads per PCR replicate, (ii) removal of the non-specific
amplifications (non-fish), (iii) removal of all sequences found in only
one PCR in the entire data set and (iv) removal tag-jumps and
index-hopping (descibed above).