A data matrix of the relative peak areas of the 101 VOMs identified in
the three groups under study, the COVID-19, RECOV, and CTRL groups
(Table S1, Supplementary material), was processed using the
Metaboanalyst software package [31]. Only VOMs with a frequency of
occurrence (FO) higher than 80% in the volatile composition of urine
were considered. To obtain a consistent distribution without redundant
values, the variables were normalised, and univariate analysis was
performed using a t-test (p < 0.05). Consequently, 17 VOMs
with insignificant contributions to the statistical analysis were
removed from the data matrix. The resulting data matrix was then
subjected to multivariate pattern recognition procedures. In Partial
Least Squares Discriminant Analysis (PLS-DA), the information present in
the VOMs fingerprint was utilised as multiple variables to visualise
group trends and clusters. This analysis revealed a clear separation
between the COVID and CTRL samples (Figure 5a). The score plot of the
top 10 variables of importance in projection (VIP > 1,
Figure S2, Supplementary material) was used to observe the relative
contributions of the metabolites to the variance between the COVID and
CTRL groups. Accordingly, 1,1,6-Trimethyl-dihydronaphthalene (TDN) and
2-heptanone showed a more significant contribution to the COVID groups,
D-carvone and
3-methoxy-5-(trifluoromethyl)aniline
(MTA) showed a more significant contribution to the CTRL group.
Figure 5. Multivariate
analysis of the COVID-19 and control group data. a ) Partial
least-squares discriminant analysis (PLS-DA) was applied to the obtained
data. b ) 10-fold CV performance of PLS-DA classification using
different numbers of components; c ) multivariate analysis of
COVID (infected) and RECOV (infected at the end) group data. Partial
least-squares discriminant analysis (PLS-DA); d ) 10-fold CV
performance of PLS-DA classification using a different number of
components (* represents the best Q2 value, the best classifier).
The robustness of the model obtained was then evaluated using a 10-fold
cross-validation performance assay to determine the goodness of fit (R2)
and the predictive ability for distinguishing between the studied groups
(Q2). As can be observed in Figure 5b, the R2 and Q2 values obtained
were close to 1, which is the highest possible robustness. A random
permutation test involving 1000 permutations was performed to assess the
statistical significance of class discrimination between the COVID and
CTRL groups, further supporting the discriminatory ability of the
statistical model obtained in this study (Supplementary Figure S2b).
The same multivariate analysis was performed to compare the data from
SARS-CoV-2 infected urine samples with those recovered from COVID-19
urine samples. In addition, PLS-DA segregated the COVID and RECOV
samples into two well-separated clusters corresponding to the infected
and recovered patients, respectively (Figure 5c). The 10-fold CV
performance and permutation test showed the good robustness of the
PLS-DA model (Figure 5d). The VIP score plot assay revealed that
β-damascenone and α-isophorone gave higher discrimination between the
COVID group, and nonanoic acid and α-terpinene provided the most
significant contribution to discriminate the RECOV group (Figure S3a,
Supplementary material). Similarly, a random permutation test involving
1000 permutations was performed to assess the statistical significance
of the class discrimination between the COVID and RECOV groups, further
supporting the discriminatory ability of the statistical model (Figure
S3b, Supplementary material).
Hierarchical clustering analysis of the volatilomic data was carried out
for the two comparisons, COVID-CTRL and COVID-RECOV, through the heat
map and dendrogram (Figure 6). A heatmap was created using Spearman’s
distance correlation to build a visual representation of the dataset,
focusing on the 15 most relevant metabolites to discriminate between the
two groups. The heat map provides an intuitive description of the
relationship between the samples and detected VOMs. The coloured
representation of the cells corresponds to the concentration of the
detected VOMs for each sample (dark blue, less concentrated; dark red,
more concentrated).