Machine learning classifier comparison
We next applied sequential forward selection (SFS) method to select
variables collected from the baseline visit until 6 months after the
baseline ESS for the three classifiers: random forest, logistic
regression and gradient boosting. The AUC values and F1-score values of
the trained classifiers were averages from 10 reformations of training
and test folds (Figure S1A). Performance values first increased fast and
then reached the plateau as a function of the number of variables
(Figure 1, Tables S5, S6). For the logistic regression classifier the
highest average AUC (.746) and the highest F1-score (.404) were achieved
with six and eleven variables, respectively. For the gradient boosting
classifier the highest AUC (.745) was with twelve variables and F1-score
(.407) was with three variables. For the random forest classifier the
highest AUC (.747) was with fifteen and F1-score (.409) was with twelve
variables.
The best variable selected by SFS (e.g. with highest AUC) of each run
was given 15 points, the next best variable 14 points, and so on. A rank
score (varying between 0-150 points) was formed from the sum of the
points for each variable (see Eq. S1, in this article’s supporting
information) after 10 reformations of training and test folds. When
using any of the three classifiers, the following variables had the
highest rank scores and were thus the most important predictors: the
number of visits 6 months after the baseline ESS, CRSwNP, NERD and
asthma (Table 3). When using the logistic regression classifier, the
visit frequency from baseline visit to baseline ESS and, the time
between baseline visit and baseline ESS were also important (Table 3).