Comparison to ONTrack
Next, we compared the performance of NGSpeciesID to the pipeline ONTrack from Maestri et al. (2019). This pipeline first clusters all reads using VSEARCH (Rognes et al., 2016), then randomly selects 200 reads, aligns those with Mafft (Katoh and Standley, 2013), calls the consensus with EMBOSS cons (http://emboss.sourceforge.net/apps/cvs/emboss/apps/cons.html), and lastly carries out polishing with 200 randomly selected reads using Nanopolish (https://github.com/jts/nanopolish.). We generated consensus sequences for all seven DNA barcodes from Maestri et al. (2019), which comprise Cytochrome C Oxidase Subunit 1 (COI) sequences of two snails and five beetles (Supplementary Table 1). We provide the respective alignments in the Supplementary (Supplementary files 7-13).
Previously, Krehenwinkel et al., (2019a) showed that consensus accuracy can decrease when too many reads (in the realm of a few hundred reads, depending on the error rate of the individual reads) are selected for the consensus generation, likely due to an increase in the signal to noise ratio. We thus randomly subsampled 300 reads using seqtk (https://github.com/lh3/seqtk), a number which has been shown to work well with Nanopore data (Krehenwinkel et al., 2019a). We see that the consensus quality is comparable between the two tools (Table 2), with accuracy of 99.8% to 100%. In five out of the seven DNA barcode sets both tools performed equally well, while in one each the two tools outperformed each other, however, differing by only 1 basepair (Table 2).