Classification results

To test the classification capability of our AIRFIHA system, a test set was first constructed by randomly selecting 100 cells from four leukocytes, i.e., monocytes, granulocytes, and B and T lymphocytes. Notably, the test set was not contained in the training set. The classification results were evaluated using recall, precision, and F1-score\cite{RN46}. F1-score, which is the harmonic mean of recall and precision, is used to characterize the final classification result. The F1-scores from the first classifier for monocytes, granulocytes, and lymphocytes are 94.0%, 95.4%, and 97.7%, respectively (detailed numerical values for recall, precision, and F1-scores are provided in Table S3). The F1-scores from the second classifier for B and T lymphocytes are 88.2% and 88.8%, respectively (detailed numerical values for recall, precision, and F1-scores are provided in Table S4). The overall detection results are summarized and visualized in Figure 4a and Table S5. The precision-recall curves\cite{RN47} for each of the classifiers in the cascaded-ResNet are plotted and shown in Figure 4b and 4c. The values of the area under the precision-recall curve (AUPRC) for lymphocytes, monocytes, and granulocytes in the first classifier are 1.00, 0.98 and 0.98, respectively. The values of AUPRC for B and T lymphocytes in the second classifier are 0.96 and 0.94, respectively. Our B/T cell classification accuracy is comparable with the method based on 3D quantitative phase imaging (note that leukocytes here were from one mice that could make a difference on the accuracy)\cite{RN38}.