2.3.2. Taxonomic classification
The merged files were aligned to phylogeny against the Greengenes reference sequence sepp-refs-gg-13-8 using q2-fragment-insertion[22]. Incorrect taxonomic and phylogenetic assignments due to differences in 16S rDNA hypervariable regions and merging the variable lengths during analysis were solved with q2-fragment insertion technique (SATe-enabled phylogenetic placement in QIIME2 plugin) [22]. The core diversity was calculated before (to calculate the impact on diversity) and after removing mitochondria (mtDNA) and chloroplast (clDNA) sequences from the datasets. The mtDNA and clDNA filtered datasets were further used for calculating diversity, taxonomy, important (core) s-OTUs and the difference in composition estimation using QIIME2 and the diversity graph was plotted within QIIME2. We used Unweighted, Weighted Unifrac and Jaccard distance matrix to compute the beta diversity, and the outcomes were envisaged using Principal Coordinates Analysis (PCoA) in QIIME2. A Permutational Multivariate Analysis of Variance (PERMANOVA)[23]thru the Unweighted, Weighted Unifrac along with Jaccord distance-based beta-diversity was calculated within QIIME2. We used standard pre-trained Greengenes library (gg_13_8_99_OTU_full-length) [24], SILVA reference database (SILVA_188_99_OTUs full-length)[25] and fragment-insertion reference dataset (ref-gg-99-taxonomy). Then we decided to discuss the results from the fragment-insertion reference dataset.
We also implemented the Analysis of the Composition of Microbiome (ANCOM) [26] in QIIME2 plugin to identify the significantly different bacteria between the copepod genera. ANCOM used F-statistics and W-statistics to determine the difference, where W represents the vigour of the ANCOM test for the tested number of species and F represents the measure of the effect size difference for a particular species between the groups (copepods). To predict the important bacteria associated with the copepods, we used sophisticated supervised machine learning classifier (SML): RandomForest Classifier (RFC) [27] and Gradient Boosting Classifier (GBC)[28] using built-in QIIME2. Which is one of the most accurate learning algorithms for managing large and noisy datasets, Random Forest often manages unbalanced sample distributions and is less susceptible to overfitting and generating unbiased classifiers[29]. The gradient boosting method involves the use of several weak learners by taking the loss function from the previous tree and using it to enhance the classification. This technique is less prone to overfitting and does not suffer from the dimensionality curse, but it is susceptible to noisy data and outliers[30].
The mtDNA and clDNA filtered table and representative sequences were also used as an input for predicting CAB potential metabolic function using Phylogenetic Investigation of Communities by Reconstruction of Unobserved States (PICRUSt2) [19]. The output abundance KEGG data were analysed in Statistical Analysis of Taxonomic and Functional Profiles (STAMP) which includes Principle Component Analysis (PCA) [31] to find the significant difference in potential functions of CAB between the copepod genera using Kruskal–Wallis H-test [32] with Tukey–Kramer parameter[33]. The kegg metabolic maps [34-36] was used as a reference to draw the figure representing the copepod genera with a high proportion of potential functional genes.