Easy-use
NGSpeciesID was designed to be straightforward to use. It works on
individual read files, outputted either directly from the basecalling or
after demultiplexing (e.g. using Minibar (Krehenwinkel et al. 2019a) or
qcat (https://github.com/nanoporetech/qcat)), but can quickly be
adjusted to run in a loop over multiple fastq files using a bashscript
(see Supplementary File 14). It only requires fastq files as input. In
contrast, ONTrack requires the input reads in three formats (fast5,
fasta and fastq), which requires additional preprocessing of the
sequencing data. Furthermore, NGSpeciesID allows fastq files to have any
naming structure, thus making it easy for the user to run and to
identify samples and replicates. This saves time on preprocessing of the
read data compared to other software solutions.
NGSpeciesID employs quality filtering of the reads based on read phred
scores. However, we recommend also removing reads much shorter or longer
than the intended target, which often represent chimeras or
contaminations using NanoFilt (De Coster et al., 2018) before running
NGSpeciesID. While our tool can handle unfiltered data, this might
result in the generation of multiple consensus sequences. NGSpeciesID
also offers the option to remove priming sites from the amplicon
sequences. As many universal primers include ambiguity codes, primer
regions can potentially include incorrect bases, and should thus be
removed. We further found that primer regions can cause issues for the
reverse-complement matching. We thus included an additional
reverse-complement matching step after primer removal, in case
NGSpeciesID outputs multiple consensus sequences. Our tool outputs
multiple consensus sequences in case the clustering results in multiple
clusters over a certain percentage of the total reads (by default this
is set to 10%). Each consensus sequence is only polished with the
corresponding reads from the clustering. This feature is very useful as
it allows the user to explore potential contaminant reads or mixed
samples through the generating of multiple consensus sequences.
NGSpeciesID and the Mothur + Consension software solution both can
handle ONT and PacBio long-read data. While both tools produce consensus
sequences of similar accuracy, Mothur + Consension requires an in-depth
knowledge of the pipeline requiring (i) preprocessing of the input data,
(ii) individual components of the pipeline to be run separately and
(iii) has parameter settings that are difficult to interpret, while
NGSpecies is designed to be user friendly and packaged as a one command
solution.