4 Discussion
The management of vocal cord leukoplakia remains a challenge despite the use of IEE techniques, such as CE and NBI, for accurate diagnosis of laryngeal lesions. While surgical resection will provide a final diagnosis, LG dysplasia of vocal cord leukoplakia may not go on to be malignant, thereby resulting in potentially unnecessary surgeries. Meanwhile, the optimal opportunity for surgery may be missed if HG dysplasia and invasive carcinoma of vocal cord leukoplakia is misdiagnosed. Treatment stratification by combining laryngoscopic imaging and AI can help to alleviate this management dilemma. To the best of our knowledge, this is the first study that has applied deep learning with Mask R-CNN to laryngoscope WLI and NBI for the automated segmentation and classification of vocal cord leukoplakia.
The use of deep learning for the detection of gastrointestinal lesions has rapidly developed and has made remarkable progress in recent years[12, 22]. Presently, some studies have reported using computer-aided detection in the segmentation or classification of laryngoscopic images. In 2015, H. Irem Turkmenet al .[23] classified vocal fold disorders into five categories using manual extraction and Histogram of Oriented Gradients (HOG) descriptors. However, one flaw in the study is that the training set was subjective and pathology was not the gold standard of classification. Bin Ji et al .[24] reported a multi-scale recurrent fully convolution neural network (CNN) for laryngeal leukoplakia segmentation. Despite favorable results, their datasets included only static images taken by WLI under optimal conditions whereas NBI is crucial for the differentiation of benign from malignant lesions. In this study, we included images of WLI and NBI in the datasets, considering that the model would be used in various modalities and applied to different hospitals. Furthermore, real-time video detection is more demanding than static images because of complex conditions such as reflect light, blurring, and airway secretions. As seen in Video 1 and Video 2, our model in this study displays the extent and subtype of vocal cord leukoplakia in real-time without pausing. Encouragingly, our DL model also demonstrated a high sensitivity (93% for WLI and 99% for NBI) and specificity (94% for WLI and 97% for NBI) per lesion for binary classification into a surgical group versus a non-surgical group. While Kono M. et al .[14] used DL with CNNs for the real-time diagnosis of pharyngeal cancers with a sensitivity of 92%, the specificity and accuracy were 47% and 66% respectively, significantly lower than our dataset. Meanwhile, our model also detected lesions correctly with a high mAP (0.81 for WL and 0.92 for NBI, IoU>0.5). In contrast, Rintaro Hashimoto et al .[23] reported a study of CNNs with an IoU threshold at 0.3 for real-time detection of early esophageal neoplasia in Barrett’s esophagus, and the overall mAP was 0.7533 and mAP for NBI was 0.802, significantly lower than our dataset. More importantly, however, we combined pathological diagnosis and clinical decisions to a grouped dataset, which gives a more realistic assessment and would be conducive to clinical promotion in the future.
It is possible to implement our proposed model in an embedded decision support system for identifying patients for whom directly proceeding to surgical treatment might be advantageous. Taken together, the outcomes of this study showed promise for efficient management of vocal cord leukoplakia. First, real-time segmentation and classification would greatly shorten laryngoscopic operation time for endoscopists, especially if inexperienced. Secondly, this model can aid otolaryngologists in decision-making. Thirdly and most importantly for patients, this approach would obviate the need for unnecessary invasive procedures such as biopsy as well as mitigate medical expenses.
However, there are also some limitations to this Mask R-CNN system. First, all tested laryngoscopic images were retrospectively taken from a single-center and obtained from the same video system. A second caveat is that multiple images were extracted from a single patient’s laryngoscope, so a learner bias was possible if images from the same patients existed in both the training and test sets. A third limitation is that this Mask R-CNN system could not completely exclude the influence of airway secretions and reflected light, which were major causes of pseudo-positive cases. We believe these limitations will be overcome in the future by including datasets from a multi-center setting with different hospitals and various laryngoscopic systems.