Validation and Comparison
Once the models were trained, discriminatory capability was assessed using the previously unseen validation data. The performance of the full ensemble model (400 algorithms) was compared to that of each type of algorithm individually (100 algorithms each), as well as a single logistic regression on its own. Model capability was assessed using area under receiver-operating-characteristic curve (AUROC), net reclassification index (NRI), and decision curve analysis (DCA). Calibration of the model was evaluated using visual plots of predicted risk based on the training cohort versus observed risk in the validation cohort stratified by decile of risk. Two-sided p-value of less than 0.05 was considered significant for all comparisons. All models were trained in python using Keras with Tensorflow.13 Performance outcome comparisons were conducted with Stata (StataCorp. 2015. Stata Statistical Software: Release 14. College Station, TX: StataCorp LP).