Step 1: Train Test Split  

In machine learning, we always split our data into the training dataset and test dataset. The purpose of splitting into distinct datasets is to evaluate the performance of machine learning models on unseen data. A common approach to split the train and test dataset by certain percentages, for example (80% train, 20% test) or (75% train, 25% test). In our model training framework, as we obtained electronic health records of over 15,000 Clostridioides difficile infection (CDI) episodes using the Clinical Data Analysis and Reporting System (CDARS), a well-established electronic database managed by the Hong Kong Hospital Authority, patient records of 41 Hong Kong public hospitals are available in our study. In our model training framework, we reserved several hospital institutions as an external validation set to evaluate the test performance of the fit machine learning models.