Figure 6. Hierarchical clustering analysis. Heatmap visualisation using the top 15 metabolites with more significance by Spearman’s correlation of a) COVID (infected) and CTRL (control) samples and b) COVID (infected) and RECOV (infected in the recovery stage) samples. Dendrogram analysis using Spearman’s distance measure and average linkage for c) COVID-CTRL and d) COVID-RECOV.
In the comparison between COVID-19 and control samples, the analysis revealed two well-defined clusters (Figure 6a). The urinary VOMs 2-methoxythiophene, toluene, α-isophorone, TDN, and hemimellitene showed a high correlation with the urine profile of COVID-19 patients. Piperitone, β-ionone, D-carvone, and eudalene were more closely related to the urinary profile of the CTRL (control) group. The dendrogram completely split the samples into two groups, matching the real groups studied (Figure 6b). In the comparison between COVID-19 and recovered samples, although the heat map perfectly clustered the volatilomic data, the cluster accuracy was visually lower than that of the first analysis (COVID-CTRL), highlighting that the COVID-19 patients’ urinary profile is closer to that of the RECOV group. Urinary VOMs such as hemimellitene, furan, β-damascenone, and α-isophorone showed a higher correlation with the COVID patients’ group, instead, 2,4-dimethylbenzaldehyde, nonanoic acid, 1-methylcycloheptene, and α-terpinene were more related to the recovered patient’s volatile profile (Figure 6c). The dendrogram only partially divided the samples of the two different groups (Figure 6d).
For the classification of true positives and false positives and their predictive ability, multivariate exploratory receiver operating characteristic (ROC) curves were created using the Monte Carlo cross-validation (MCCV) methodology. The features importance, selected using 2/3 of the samples, were utilized to construct classification models, which were validated on the remaining 1/3 of the samples that were not initially used. This process was repeated several times to determine the performance of each model and to calculate the confidence intervals. From these samples, the top three, five, ten, twenty, thirty, and 61 important features were identified, and the built curves were reported (Figures 7a and 7c). Figure 7a displays the ROC curves for different sets of important features for the COVID-CTRL (COVID-19 patients and control subjects). The area under the curve (AUC) values obtained, ranging from 0.988 to 1, indicated excellent discriminative accuracy between the two groups. The plot in Figure 7c illustrates the ROC curves for the patient comparison (COVID-19 patients and infected subjects during the recovery period). In this case, the area under the curve (AUC) values fell in the range of 0.937-0.987, which shows an optimal ability to discriminate between the groups. These values were calculated using 95% confidence intervals to demonstrate the reliability of the results. Figure 7b and Figure 7d illustrate the predictive accuracy of the biomarker models as the number of features increased. As more features were included in the models, predictive accuracy improved. This suggests that the selected features contribute to the differentiation between the control and COVID-19 groups, and COVID-19 and recovered groups. The predicted class probabilities was assessed through the performance of the classification model for COVID-CTRL groups (Figure 7e) and COVID-RECOVERED groups (Figure 7f). Overall, the results demonstrate the promising performance of the biomarker models, with high accuracy in distinguishing between the two groups.