Feature Importance and Dimensionality Reduction
In order to increase the interpretability of the model, we first
performed a univariate estimate of feature importance. A random forest
classifier was trained with 10-fold cross-validation repeated three
times. Univariate feature importance was estimated using the mean
decrease in accuracy. Three separate feature importance models were
developed first with continuous variables only (N=37), then with
categorical variables (N=195), and then with all the variables (N=232).
This was done to account for the known tendency for random forest
classifiers to more heavily weight continuous variables in feature
importance estimates. The 20 most important features from each model
were combined into a single dataset, consisting of 47 variables after
excluding duplicates. This variable set was subsequently used to develop
the machine learning algorithms. Overall, this methodology of feature
selection allows us to combine the best of both manual filtering based
on clinical acumen and automated methods using machine learning
techniques. Decreasing the input space used for the final machine
learning model is a critical step that has been shown to improve overall
model performance by decreasing tendency to overfit and increasing
training efficiency.12