Machine learning classifier comparison
We next applied sequential forward selection (SFS) method to select variables collected from the baseline visit until 6 months after the baseline ESS for the three classifiers: random forest, logistic regression and gradient boosting. The AUC values and F1-score values of the trained classifiers were averages from 10 reformations of training and test folds (Figure S1A). Performance values first increased fast and then reached the plateau as a function of the number of variables (Figure 1, Tables S5, S6). For the logistic regression classifier the highest average AUC (.746) and the highest F1-score (.404) were achieved with six and eleven variables, respectively. For the gradient boosting classifier the highest AUC (.745) was with twelve variables and F1-score (.407) was with three variables. For the random forest classifier the highest AUC (.747) was with fifteen and F1-score (.409) was with twelve variables.
The best variable selected by SFS (e.g. with highest AUC) of each run was given 15 points, the next best variable 14 points, and so on. A rank score (varying between 0-150 points) was formed from the sum of the points for each variable (see Eq. S1, in this article’s supporting information) after 10 reformations of training and test folds. When using any of the three classifiers, the following variables had the highest rank scores and were thus the most important predictors: the number of visits 6 months after the baseline ESS, CRSwNP, NERD and asthma (Table 3). When using the logistic regression classifier, the visit frequency from baseline visit to baseline ESS and, the time between baseline visit and baseline ESS were also important (Table 3).