Arthrobotrys oligospora
The DNA sequencing of ITS, TUB, TEF and RPB2 fragments was conducted on the strains of Arthrobotrys oligospora , which were found at every sampling site. The four sequences from each strain were summarized in a txt file and converted into a fasta file. MAFFT version 7 was used to generate the multi-sequence matrix (homologous sequences searching), and Bioedit was used to manually improve the accuracy of the alignment. We used jmodeltest software to select the optimal calculating alternative model for the piecing sequence. The phylogenetic tree was constructed following the Maximum Likelihood method (ML) in the way of partition calculation using IQ-Tree version 1.6.5 software. We used FigTree version 1.3.1, Microsoft Word (Microsoft office 2007) and Photoshop (Adobe Photoshop CS5 V12.0) to read the phylogenetic tree.
1.5.3 Biogeographical distribution of NTF in Yunnan
A first prediction was performed using the different watershed as unique predictors of the phylogenetic tree clades distribution ofArthrobotrys oligospora . We assigned clades to watersheds using the majority rule and assessed classification accuracy using error matrix metrics.
From the original dataset, we generated 45 different splits between train and test samples with train ratios comprising between 30% and 70% (step=5). Every ratio step included five different mixes of train/test samples. Voronoi (Thiessen) polygons were computed around each training set and assigned the clade corresponding to their originating point. The polygons were used to predict the test dataset clades.
1.5.4 Multivariate machine learning model
We compiled a dataset of 97 variables divided based on bioclimatic (24), topographic (6), vegetation (12), and soil properties (55) using QGIS (Table S1). In addition to the watershed, 19 predictors were selected among an initial pool of 97 variables. Table S2 shows the selection results after multinomial logistic regression and collinearity screening. In these cases, we always retained the values for the first layer.
To select the most suitable predictor variables to include in our machine learning model, we hierarchically screened each variable in our dataset. We iteratively preformed multinomial logistic regressions to assess the effects of each predictor variable on clade. We assessed the relationships among predictor variables using correlation matrices (Pearson’s and Kendall’s), and assessed multicollinearity using the Variance Inflation Factor.
The selected predictor variables and watershed predictors constituted the final dataset to investigate patterns in clade distribution using an Extra-Trees classifier (Extremely Randomized Trees). We aimed to improve the prediction ability from the previous analysis (watershed subdivision as a unique predictor), and therefore used the same accuracy metrics as in that analysis. Hyperparameters were optimized with Grid Search and Cross Validation techniques, though we also considered bootstrapping and accuracy estimations on the out-of-bag samples. Relative feature importance was calculated for the best model.