Figure
6. Hierarchical clustering analysis. Heatmap visualisation using the
top 15 metabolites with more significance by Spearman’s correlation of
a) COVID (infected) and CTRL (control) samples and b) COVID (infected)
and RECOV (infected in the recovery stage) samples. Dendrogram analysis
using Spearman’s distance measure and average linkage for c) COVID-CTRL
and d) COVID-RECOV.
In the comparison between COVID-19 and control samples, the analysis
revealed two well-defined clusters (Figure 6a). The urinary VOMs
2-methoxythiophene, toluene, α-isophorone, TDN, and hemimellitene showed
a high correlation with the urine profile of COVID-19 patients.
Piperitone, β-ionone, D-carvone, and eudalene were more closely related
to the urinary profile of the CTRL (control) group. The dendrogram
completely split the samples into two groups, matching the real groups
studied (Figure 6b). In the comparison between COVID-19 and recovered
samples, although the heat map perfectly clustered the volatilomic data,
the cluster accuracy was visually lower than that of the first analysis
(COVID-CTRL), highlighting that the COVID-19 patients’ urinary profile
is closer to that of the RECOV group. Urinary VOMs such as
hemimellitene, furan, β-damascenone, and α-isophorone showed a higher
correlation with the COVID patients’ group, instead,
2,4-dimethylbenzaldehyde, nonanoic acid, 1-methylcycloheptene, and
α-terpinene were more related to the recovered patient’s volatile
profile (Figure 6c). The dendrogram only partially divided the samples
of the two different groups (Figure 6d).
For the classification of true positives and false positives and their
predictive ability, multivariate exploratory receiver operating
characteristic (ROC) curves were created using the Monte Carlo
cross-validation (MCCV) methodology. The features importance, selected
using 2/3 of the samples, were utilized to construct classification
models, which were validated on the remaining 1/3 of the samples that
were not initially used. This process was repeated several times to
determine the performance of each model and to calculate the confidence
intervals. From these samples, the top three, five, ten, twenty, thirty,
and 61 important features were identified, and the built curves were
reported (Figures 7a and 7c). Figure 7a displays the ROC curves for
different sets of important features for the COVID-CTRL (COVID-19
patients and control subjects). The area under the curve (AUC) values
obtained, ranging from 0.988 to 1, indicated excellent discriminative
accuracy between the two groups. The plot in Figure 7c illustrates the
ROC curves for the patient comparison (COVID-19 patients and infected
subjects during the recovery period). In this case, the area under the
curve (AUC) values fell in the range of 0.937-0.987, which shows an
optimal ability to discriminate between the groups. These values were
calculated using 95% confidence intervals to demonstrate the
reliability of the results. Figure 7b and Figure 7d illustrate the
predictive accuracy of the biomarker models as the number of features
increased. As more features were included in the models, predictive
accuracy improved. This suggests that the selected features contribute
to the differentiation between the control and COVID-19 groups, and
COVID-19 and recovered groups. The predicted class probabilities was
assessed through the performance of the classification model for
COVID-CTRL groups (Figure 7e) and COVID-RECOVERED groups (Figure 7f).
Overall, the results demonstrate the promising performance of the
biomarker models, with high accuracy in distinguishing between the two
groups.