Arthrobotrys oligospora
The DNA sequencing of ITS, TUB, TEF and RPB2 fragments was conducted on
the strains of Arthrobotrys oligospora , which were found at every
sampling site. The four sequences from each strain were summarized in a
txt file and converted into a fasta file. MAFFT version 7 was used to
generate the multi-sequence matrix (homologous sequences searching), and
Bioedit was used to manually improve the accuracy of the alignment. We
used jmodeltest software to select the optimal calculating alternative
model for the piecing sequence. The phylogenetic tree was constructed
following the Maximum Likelihood method (ML) in the way of partition
calculation using IQ-Tree version 1.6.5 software. We used FigTree
version 1.3.1, Microsoft Word (Microsoft office 2007) and Photoshop
(Adobe Photoshop CS5 V12.0) to read the phylogenetic tree.
1.5.3 Biogeographical distribution of NTF in Yunnan
A first prediction was performed using the different watershed as unique
predictors of the phylogenetic tree clades distribution ofArthrobotrys oligospora . We assigned clades to watersheds using
the majority rule and assessed classification accuracy using error
matrix metrics.
From the original dataset, we generated 45 different splits between
train and test samples with train ratios comprising between 30% and
70% (step=5). Every ratio step included five different mixes of
train/test samples. Voronoi (Thiessen) polygons were computed around
each training set and assigned the clade corresponding to their
originating point. The polygons were used to predict the test dataset
clades.
1.5.4 Multivariate machine learning model
We compiled a dataset of 97 variables divided based on bioclimatic (24),
topographic (6), vegetation (12), and soil properties (55) using QGIS
(Table S1). In addition to the watershed, 19 predictors were selected
among an initial pool of 97 variables. Table S2 shows the selection
results after multinomial logistic regression and collinearity
screening. In these cases, we always retained the values for the first
layer.
To select the most suitable predictor variables to include in our
machine learning model, we hierarchically screened each variable in our
dataset. We iteratively preformed multinomial logistic regressions to
assess the effects of each predictor variable on clade. We assessed the
relationships among predictor variables using correlation matrices
(Pearson’s and Kendall’s), and assessed multicollinearity using the
Variance Inflation Factor.
The selected predictor variables and watershed predictors constituted
the final dataset to investigate patterns in clade distribution using an
Extra-Trees classifier (Extremely Randomized Trees). We aimed to improve
the prediction ability from the previous analysis (watershed subdivision
as a unique predictor), and therefore used the same accuracy metrics as
in that analysis. Hyperparameters were optimized with Grid Search and
Cross Validation techniques, though we also considered bootstrapping and
accuracy estimations on the out-of-bag samples. Relative feature
importance was calculated for the best model.