Table 1: The sensitivity and specificity of the various supervised models evaluated using target variable/s of (a) one-dimensional (T1-relaxation), (b) one-dimensional (T2-relaxation), (c) two-dimensional (T1-relaxation, T2-relaxation), and (d) three-dimensional (T1-relaxation, T2-relaxation, A-ratio) from leave-one-out training method. The synonyms used were: T1-relaxation/T2-relaxation (A-ratio), area under the curve (AUC), classification accuracy (CA), F1 score – the balance between precision and recall, Precision – how many selected items were relevant, Recall - how many relevant items are selected. The training method using cross validation of k=5 was also evaluated for comparison (Supp. Fig. 4).
Results
Each edible oils (i.e., peanut, olive, sunflower, corn) were assigned to its´ respective label (A, B, C, D) following the blinded NMR measurements. As depicted in the one-dimensional map, each of oil contents has a specific T1 relaxation and T2 relaxation characteristic reading (Figs. 2a-b). The means for T1 relaxation time were (191.3, 199.3, 228.4, 247.8) ms and means for T2 relaxation time were (127.9, 136.8, 162, 163) ms for (A, B, C, D), respectively.
The spread of the readings were, however, substantially large making objects (A and B) and objects (C and D) inseparable in the T1 relaxation dimension (P >0.05) (Fig. 2a). Further in the T2 relaxation dimension, the objects (C and D) were also inseparable (Fig. 2b). The undesirable spread causes (similarly to spectral) cluster overlapping and hence making classification difficult (if not impossible). One straightforward solution is to increase the SNR (e.g., increasing the number scans) or/and increase the number of samplings, which unfortunately, came at the expenses of acquisition time. In addition, the relaxation time of liquid sample is inherently long. On the other hand, using the Clustering NMR method (as proposed in this work), one can leverages on the combined characteristic of (T1, T2) relaxation times of the oil contents. It forms (visibly) unique and specific cluster based on the oil contents (´molecular fingerprint´) in (pseudo) two-dimensional map (Fig. 2c). With the minor exception of corn oil (which partially overlapped with sunflower oils), which could be due to possible adulteration or factory processes. Upon further investigation, we found that this artifact can be removed with higher SNR.
Interestingly, unsupervised techniques based clustering analysis (e.g., hierarchical clustering (HC), tree-based classification, and k-means) can be performed in conveniently using (open-source code) user friendly third party software (e.g., R , or Orange 3.1.2). A front-end statistical programming language allows the clustering analysis (once compiled), can be executed in the next occasion. The HC analysis successfully separated the (peanut and olive) cluster from the (sunflower and corn) cluster, and subsequently split between themselves (Fig. 3). The HC was constructed based on Euclidean distance (between T1 relaxation and T2 relaxation) and its´ quantitative linkages (e.g., inter/intra cluster similarity) were shown in a heat map. The HC methods also confirmed the oil variants (A, A´, B, B´, C, C´, C´´, D) based on its´ respective manufacturer. Similarly, the Chemometric approach[31] based on fat compositions (Supp. Fig. 2) and tree-based classification technique based on the T1-relaxation cutoff and T2-relaxation cutoff criterion (Supp. Fig. 3) appear to be in good agreement (qualitatively) with the HC classification using Euclidian distance of T1 relaxation and T2 relaxation obtained with NMR experimentally. It is worth noting, however, that the figures (i.e., fat compositions) given by the manufacturers are for references (and not for scientific) purposes. The clustering analysis models despite using various differential clustering criterions (e.g., Euclidean distance, fat compositions, relaxation cutoff) were in agreement with our observation (Clustering NMR, Fig. 2c). This demonstrated the robustness of Clustering NMR method, which can be validated using unsupervised techniques.
In order to evaluate the classification accuracy on the quantitative basis, various supervised learning models (i.e., kNN, random forest, neural network, naïve Bayes, and logistic regression) were used to train, validate and predict the datasets. The Area Under Curve (AUC) as evaluated with Receiver Operating Characteristic (ROC) were on average (0.820, 0.876, 0.915, 0.933) with (one-dimensional (T1-relaxation), one-dimensional (T2-relaxation), two-dimensional (T1-relaxation, T2-relaxation), and three-dimensional (T1-relaxation, T2-relaxation, A-ratio)), respectively, using the leave-one-out training method (Fig. 4). A-ratio is the ratio between T1-relaxation and T2 -relaxation. Similar conclusions were observed using cross validation method (e.g., k=5) (details in Supp. Table 1). This confirmed that the sensitivity and specificity of the proposed Clustering NMR method has substantially improved at the higher order of (pseudo)-dimensionality (e.g., 2D or multidimensional) over low dimensionality (e.g., n=1). With the (minor) exception of logistic regression, all the supervised models performed reasonably well (AUC>0.80) (Table 1). Furthermore, all the machine learning tasks run simultaneously and computational time taken were typically in less than 1 minute (in this work).
Discussion
The proposed Clustering NMR method works on the rational that accumulative characteristic of each dimensionality would forms a specific and unique signature (´molecular fingerprint´). This is the same concept which borrowed from the data mining[32]. Fortunately, the characteristic of (T1, T2) relaxation times in the relaxometry is rather specific and prominent, and as the results suggested, an optimal n=2 to 3 of dimensionality are essential to attain a high AUC (Fig. 4)[33]. With the recent advances in machine learning, however, its´ becoming computationally cheaper (e.g., shorter analysis time) to calculate a big dataset. The computational time reported in this analysis (less than one minute) much shorter than a conventional two- or multidimensional NMR (>hours), without resorting to the use of Ultrafast NMR.
Two- or multidimensional relaxometry experiments (e.g., T1-T2 correlation spectroscopy), however, may provides much more information (e.g., cross peaks) but are far more time consuming than that of Clustering NMR method. One way to speed up acquisition time is to employ the use of gradient fields (e.g., Ultrafast NMR[30], continuous spatial encoding[34]) which require modification to the radio-frequency probe. Machine learning in the form of dimension reductionist (e.g., principal component analysis (PCA), partial least squares (PLS)) have also been used to reduce the dimensionality in multidimensional spectroscopy (e.g., NMR metabolomics[19,35,36]). A recent deep learning assistive NMR spectroscopy[18], which signals reconstructing were demonstrated. We summarized and compared Clustering NMR method with the state-of-the-art methodologies in a SWOT-like analysis (Table 2).
In conclusion, this proposed methodology, termed as Clustering NMR is extremely powerful for rapid and accurate classification of objects using the low-field NMR. This methodology is highly distruptive to the low-field NMR applications, in particularly, the recent reported NMR-based PoCT medical diagnostic. These include the immuno-magnetic labelled detection (e.g., tumour cells[14,20], tuberculosis[37] and magneto-DNA detection of bacteria[38]) and the label-free detection of various pathological states (e.g., blood oxygenation[15]/oxidation level[10] and malaria screening[21,22,39]). Interestingly, with the recent advances on machine learning technique, it has become remarkably efficient that a large data run in almost in ´real-time mode´, which open-up opportunity to combine real-time NMR (or MRI) with machine learning simultaneously.
(1675 words)
Table 2: State-of-the-art (with/without) machine learning assistive NMR works in comparison to the current work (Clustering NMR).