3 Results
124 patients with HG dysplasia and invasive carcinoma, and 92 patients
with LG dysplasia were selected for this study. Because a patient may
display multiple lesions, there were 168 lesions among patients with HG
dysplasia and invasive carcinoma, and 92 lesions among patients with LG
dysplasia. For the 2220 images taken in NBI mode, 1104 were classified
as NSG, while 1100 were classified as SG. For the 2144 images in WLI
mode, 1004 were classified as NSG while 1140 were SG. All images were
divided into a training set, a validation set, and a test set in a 6:2:2
ratio. In NBI mode, there were 1204 images in the training set, 508
images in the validation set, and 508 images in the test set. In WLI
mode, there were 1160 images in the training set, 492 images in the
validation set, and 492 images in the test set .
We evaluated the performance of the DL model by the segmentation and
classification of images in NBI and WLI modes. Model segmentation was
compared against segmentation performed by senior expert endoscopists
with at least 10 years of laryngoscopy experience. Model classification
as SG or NSG was compared against classification by clinical decision
using pathology as a gold standard.
For segmentation, the average IoU value exceeded 70% in WLI and NBI
modes (Table 1 ). The DL model can detect 87% of vocal cord
leukoplakia in WLI mode and 92% of vocal cord leukoplakia in NBI mode
with an IoU> 0.5. With an increased IoU criterion of
> 0.7, the detection rate in the two modes remained
acceptable with a greater than 60% detection rate (Table S1 ).
The partial segmentation results using the learned model in WLI and NBI
modes are shown in Figure 3 .
To measure the performance of pure classification of target regions, we
did not initially set an IoU threshold. We abandoned images with an IoU
less than or equal to zero (only 1 image was disqualified)
(Table 2 ). The model’s binary classification of WLI test set
(616 images) into SG and NSG demonstrated a sensitivity of 93% (95%
CI, 88%-98%) and a specificity of 94% (95% CI, 88%-100%).
Impressively, the model’s binary classification of NBI test set images
(620 images) demonstrated a higher sensitivity of 99% (95% CI ,
97%-101%) and a higher specificity of 97% (95% CI, 93%-101%). The
model’s PPV for WLI and NBI were 97% (95% CI, 94%-100%) and 98%
(95% CI, 95%-101%), respectively. The model’s NPV for WLI and NBI
were 87% (95% CI, 78%-96%) and 98% (95% CI, 95%-101%),
respectively.
The model manifested optimal performance in the segmentation and
classification. However, the accuracy of the model depends on two
factors: the IoU of the segmented lesions compared to manual annotation
needs to be greater than the preset IoU> 0.5 criterion, and
simultaneously the classification of the segmented lesion area must be
accurate. Thus, we calculated the mAP of the model at different IoUs
(the minimum threshold of IoU is 0.5) (Table 3 ). In our test
set with an IoU>0.5, the mAP for WLI and NBI was 0.81 and
0.92, respectively. With an IoU>0.7, the mAP for WLI and
NBI was acceptable.
Our video model was capable of processing at least 25 frames per second
with a latency period of less than 40 ms in real-time video analysis.
Video clips demonstrating classification of NSG and SG are shown for WLI
in Video 1 and NBI in Video 2 .