Validation and Comparison
Once the models were trained, discriminatory capability was assessed
using the previously unseen validation data. The performance of the full
ensemble model (400 algorithms) was compared to that of each type of
algorithm individually (100 algorithms each), as well as a single
logistic regression on its own. Model capability was assessed using area
under receiver-operating-characteristic curve (AUROC), net
reclassification index (NRI), and decision curve analysis (DCA).
Calibration of the model was evaluated using visual plots of predicted
risk based on the training cohort versus observed risk in the validation
cohort stratified by decile of risk. Two-sided p-value of less than 0.05
was considered significant for all comparisons. All models were trained
in python using Keras with Tensorflow.13 Performance
outcome comparisons were conducted with Stata (StataCorp. 2015. Stata
Statistical Software: Release 14. College Station, TX: StataCorp LP).