SWARM clustering analyses for MOTU identification
We applied a second bioinformatic workflow to cluster sequences into taxonomic units without requiring a complete reference database to estimate richness and MOTUs composition (Marques et al. 2020). We used the sequence clustering SWARM algorithms that group multiple variants of sequences into MOTU (Molecular Operational Taxonomic Units; Mahé et al. 2014, Rognes et al. 2016). Reads were assembled using VSEARCH (Rognes et al. 2016), then demultiplex and trimmed using CUTADAPT (Martin 2013) and clustering was performed using SWARM (Mahé et al. 2014) with a minimal distance of 1 between each cluster. The clustering algorithm uses sequence similarity and abundance patterns to delineate meaningful entities, by grouping together sequence variants. Once MOTUs are generated, the most abundant sequence within each cluster is used as a representative sequence for taxonomic assignment. Then, a post-clustering curation algorithm (LULU, Frøslev et al. 2017) was applied to curate the data. The taxonomic assignment was performed using the ECOTAG program against the NCBI database. The taxonomic level of assignments was determined by the result of the ECOTAG program and the percentage of similarity between the sequences in the sample and those in the reference database. Taxonomic levels were corrected using the same thresholds as the pipeline using the ObiTools. Cleaning filters were then applied to remove sequences most likely corresponding to errors and non-specific amplifications: (i) removal of amplicons with less than 10 reads per PCR replicate, (ii) removal of the non-specific amplifications (non-fish), (iii) removal of all sequences found in only one PCR in the entire data set and (iv) removal tag-jumps and index-hopping (descibed above).