After all the data preprocessing steps, the goal here is to find out which algorithms can attain the highest test accuracy. In model training of logistic regression, random forest, and gradient boosting machine, we need to first identify the best set of hyperparameters for each algorithm. The best hyperparameter set can be found using either grid search or randomized grid search. The grid search method enumerates all the combinations based on the hyperparameter space and repeat the model training process for each combination. Randomized grid search is a more efficiency way as a fixed number of iterations is specified to sample the parameters. A total of 1,000 iterations had been carried out to determine the best hyperparameter set. K-fold cross validation is performed to validate whether the selected hyperparameter set can generalize well on unseen data. In K-fold cross validation, the data is equally divided into k subsets. The models are trained on k-1 subsets and validated on the remaining subsets. The training process is repeated k-times until all subsets had been used as validation once. All 5 algorithms have been repeatedly trained for 50 times to develop the best models with best set of hyperparameters.
The 5 algorithms include: