3 Results
124 patients with HG dysplasia and invasive carcinoma, and 92 patients with LG dysplasia were selected for this study. Because a patient may display multiple lesions, there were 168 lesions among patients with HG dysplasia and invasive carcinoma, and 92 lesions among patients with LG dysplasia. For the 2220 images taken in NBI mode, 1104 were classified as NSG, while 1100 were classified as SG. For the 2144 images in WLI mode, 1004 were classified as NSG while 1140 were SG. All images were divided into a training set, a validation set, and a test set in a 6:2:2 ratio. In NBI mode, there were 1204 images in the training set, 508 images in the validation set, and 508 images in the test set. In WLI mode, there were 1160 images in the training set, 492 images in the validation set, and 492 images in the test set .
We evaluated the performance of the DL model by the segmentation and classification of images in NBI and WLI modes. Model segmentation was compared against segmentation performed by senior expert endoscopists with at least 10 years of laryngoscopy experience. Model classification as SG or NSG was compared against classification by clinical decision using pathology as a gold standard.
For segmentation, the average IoU value exceeded 70% in WLI and NBI modes (Table 1 ). The DL model can detect 87% of vocal cord leukoplakia in WLI mode and 92% of vocal cord leukoplakia in NBI mode with an IoU> 0.5. With an increased IoU criterion of > 0.7, the detection rate in the two modes remained acceptable with a greater than 60% detection rate (Table S1 ). The partial segmentation results using the learned model in WLI and NBI modes are shown in Figure 3 .
To measure the performance of pure classification of target regions, we did not initially set an IoU threshold. We abandoned images with an IoU less than or equal to zero (only 1 image was disqualified) (Table 2 ). The model’s binary classification of WLI test set (616 images) into SG and NSG demonstrated a sensitivity of 93% (95% CI, 88%-98%) and a specificity of 94% (95% CI, 88%-100%). Impressively, the model’s binary classification of NBI test set images (620 images) demonstrated a higher sensitivity of 99% (95% CI , 97%-101%) and a higher specificity of 97% (95% CI, 93%-101%). The model’s PPV for WLI and NBI were 97% (95% CI, 94%-100%) and 98% (95% CI, 95%-101%), respectively. The model’s NPV for WLI and NBI were 87% (95% CI, 78%-96%) and 98% (95% CI, 95%-101%), respectively.
The model manifested optimal performance in the segmentation and classification. However, the accuracy of the model depends on two factors: the IoU of the segmented lesions compared to manual annotation needs to be greater than the preset IoU> 0.5 criterion, and simultaneously the classification of the segmented lesion area must be accurate. Thus, we calculated the mAP of the model at different IoUs (the minimum threshold of IoU is 0.5) (Table 3 ). In our test set with an IoU>0.5, the mAP for WLI and NBI was 0.81 and 0.92, respectively. With an IoU>0.7, the mAP for WLI and NBI was acceptable.
Our video model was capable of processing at least 25 frames per second with a latency period of less than 40 ms in real-time video analysis. Video clips demonstrating classification of NSG and SG are shown for WLI in Video 1 and NBI in Video 2 .