Feature Importance and Dimensionality Reduction
In order to increase the interpretability of the model, we first performed a univariate estimate of feature importance. A random forest classifier was trained with 10-fold cross-validation repeated three times. Univariate feature importance was estimated using the mean decrease in accuracy. Three separate feature importance models were developed first with continuous variables only (N=37), then with categorical variables (N=195), and then with all the variables (N=232). This was done to account for the known tendency for random forest classifiers to more heavily weight continuous variables in feature importance estimates. The 20 most important features from each model were combined into a single dataset, consisting of 47 variables after excluding duplicates. This variable set was subsequently used to develop the machine learning algorithms. Overall, this methodology of feature selection allows us to combine the best of both manual filtering based on clinical acumen and automated methods using machine learning techniques. Decreasing the input space used for the final machine learning model is a critical step that has been shown to improve overall model performance by decreasing tendency to overfit and increasing training efficiency.12