4 Discussion
The management of vocal cord leukoplakia remains a challenge despite the
use of IEE techniques, such as CE and NBI, for accurate diagnosis of
laryngeal lesions. While surgical resection will provide a final
diagnosis, LG dysplasia of vocal cord leukoplakia may not go on to be
malignant, thereby resulting in potentially unnecessary surgeries.
Meanwhile, the optimal opportunity for surgery may be missed if HG
dysplasia and invasive carcinoma of vocal cord leukoplakia is
misdiagnosed. Treatment stratification by combining laryngoscopic
imaging and AI can help to alleviate this management dilemma. To the
best of our knowledge, this is the first study that has applied deep
learning with Mask R-CNN to laryngoscope WLI and NBI for
the automated segmentation and
classification of vocal cord leukoplakia.
The use of deep learning for the detection of gastrointestinal lesions
has rapidly developed and has made remarkable progress in recent
years[12, 22]. Presently, some studies have
reported using computer-aided detection in the segmentation or
classification of laryngoscopic images. In 2015, H. Irem Turkmenet al .[23] classified vocal fold disorders
into five categories using manual extraction and Histogram of Oriented
Gradients (HOG) descriptors. However, one flaw in the study is that the
training set was subjective and pathology was not the gold standard of
classification. Bin Ji et
al .[24] reported a multi-scale recurrent fully
convolution neural network (CNN) for laryngeal leukoplakia segmentation.
Despite favorable results, their datasets included only static images
taken by WLI under optimal conditions whereas NBI is crucial for the
differentiation of benign from malignant lesions. In this study, we
included images of WLI and NBI in the datasets, considering that the
model would be used in various modalities and applied to different
hospitals. Furthermore, real-time video detection is more demanding than
static images because of complex conditions such as reflect light,
blurring, and airway secretions. As seen in Video 1 and Video 2, our
model in this study displays the extent and subtype of vocal cord
leukoplakia in real-time without pausing. Encouragingly, our DL model
also demonstrated a high sensitivity (93% for WLI and 99% for NBI) and
specificity (94% for WLI and 97% for NBI) per lesion for binary
classification into a surgical group versus a non-surgical group. While
Kono M. et al .[14] used DL with CNNs for
the real-time diagnosis of pharyngeal cancers with a sensitivity of
92%, the specificity and accuracy were 47% and 66% respectively,
significantly lower than our dataset. Meanwhile, our model also detected
lesions correctly with a high mAP (0.81 for WL and 0.92 for NBI,
IoU>0.5). In contrast, Rintaro Hashimoto et
al .[23] reported a study of CNNs with an IoU
threshold at 0.3 for real-time detection of early esophageal neoplasia
in Barrett’s esophagus, and the overall mAP was 0.7533 and mAP for NBI
was 0.802, significantly lower than our dataset. More importantly,
however, we combined pathological diagnosis and clinical decisions to a
grouped dataset, which gives a more realistic assessment and would be
conducive to clinical promotion in the future.
It is possible to implement our proposed model in an embedded decision
support system for identifying patients for whom directly proceeding to
surgical treatment might be advantageous. Taken together, the outcomes
of this study showed promise for efficient management of vocal cord
leukoplakia. First, real-time segmentation and classification would
greatly shorten laryngoscopic operation time for endoscopists,
especially if inexperienced. Secondly, this model can aid
otolaryngologists in decision-making. Thirdly and most importantly for
patients, this approach would obviate the need for unnecessary invasive
procedures such as biopsy as well as mitigate medical expenses.
However, there are also some limitations to this Mask R-CNN system.
First, all tested laryngoscopic images were retrospectively taken from a
single-center and obtained from the same video system. A second caveat
is that multiple images were extracted from a single patient’s
laryngoscope, so a learner bias was possible if images from the same
patients existed in both the training and test sets. A third limitation
is that this Mask R-CNN system could not completely exclude the
influence of airway secretions and reflected light, which were major
causes of pseudo-positive cases. We believe these limitations will be
overcome in the future by including datasets from a multi-center setting
with different hospitals and various laryngoscopic systems.