Model Training and Performance Comparison
A total of 400 algorithms were trained using varying subsets of training
data based on randomly stratified levels of over and under resampling of
the training dataset (Figure 2 ). The final ensemble ML model
contained all 400 underlying algorithms, while smaller sized ensemble
models were also combined using the 100 iterations of each type of
algorithm individually for comparison. The optimal model performance was
the complete ensemble ML model (Figure 3 ), outperforming all
other models with an AUROC of 0.764 (95% CI, 0.745-0.782)
(p<0.001). By comparison, the singular logistic regression
model had an AUROC of 0.649 (95% CI, 0.628-0.670). Additionally, the
final ensemble ML model demonstrated an improvement of 72.9% ±3.8%
(p<0.001) in predictive performance as assessed by net
reclassification index compared to logistic regression. The decision
curve analysis showed the final ensemble method improved risk prediction
across the entire spectrum of predicted risk as compared to all other
models (Figure 4 , p<0.001). The final ensemble ML
model was well-calibrated, with the majority of observed risk in the
validation cohort falling within range of predicted risk based on the
training cohort after stratifying into deciles of risk
(Supplemental Figure 2 ).