A data matrix of the relative peak areas of the 101 VOMs identified in the three groups under study, the COVID-19, RECOV, and CTRL groups (Table S1, Supplementary material), was processed using the Metaboanalyst software package [31]. Only VOMs with a frequency of occurrence (FO) higher than 80% in the volatile composition of urine were considered. To obtain a consistent distribution without redundant values, the variables were normalised, and univariate analysis was performed using a t-test (p < 0.05). Consequently, 17 VOMs with insignificant contributions to the statistical analysis were removed from the data matrix. The resulting data matrix was then subjected to multivariate pattern recognition procedures. In Partial Least Squares Discriminant Analysis (PLS-DA), the information present in the VOMs fingerprint was utilised as multiple variables to visualise group trends and clusters. This analysis revealed a clear separation between the COVID and CTRL samples (Figure 5a). The score plot of the top 10 variables of importance in projection (VIP > 1, Figure S2, Supplementary material) was used to observe the relative contributions of the metabolites to the variance between the COVID and CTRL groups. Accordingly, 1,1,6-Trimethyl-dihydronaphthalene (TDN) and 2-heptanone showed a more significant contribution to the COVID groups, D-carvone and 3-methoxy-5-(trifluoromethyl)aniline (MTA) showed a more significant contribution to the CTRL group.
Figure 5. Multivariate analysis of the COVID-19 and control group data. a ) Partial least-squares discriminant analysis (PLS-DA) was applied to the obtained data. b ) 10-fold CV performance of PLS-DA classification using different numbers of components; c ) multivariate analysis of COVID (infected) and RECOV (infected at the end) group data. Partial least-squares discriminant analysis (PLS-DA); d ) 10-fold CV performance of PLS-DA classification using a different number of components (* represents the best Q2 value, the best classifier).
The robustness of the model obtained was then evaluated using a 10-fold cross-validation performance assay to determine the goodness of fit (R2) and the predictive ability for distinguishing between the studied groups (Q2). As can be observed in Figure 5b, the R2 and Q2 values obtained were close to 1, which is the highest possible robustness. A random permutation test involving 1000 permutations was performed to assess the statistical significance of class discrimination between the COVID and CTRL groups, further supporting the discriminatory ability of the statistical model obtained in this study (Supplementary Figure S2b).
The same multivariate analysis was performed to compare the data from SARS-CoV-2 infected urine samples with those recovered from COVID-19 urine samples. In addition, PLS-DA segregated the COVID and RECOV samples into two well-separated clusters corresponding to the infected and recovered patients, respectively (Figure 5c). The 10-fold CV performance and permutation test showed the good robustness of the PLS-DA model (Figure 5d). The VIP score plot assay revealed that β-damascenone and α-isophorone gave higher discrimination between the COVID group, and nonanoic acid and α-terpinene provided the most significant contribution to discriminate the RECOV group (Figure S3a, Supplementary material). Similarly, a random permutation test involving 1000 permutations was performed to assess the statistical significance of the class discrimination between the COVID and RECOV groups, further supporting the discriminatory ability of the statistical model (Figure S3b, Supplementary material).
Hierarchical clustering analysis of the volatilomic data was carried out for the two comparisons, COVID-CTRL and COVID-RECOV, through the heat map and dendrogram (Figure 6). A heatmap was created using Spearman’s distance correlation to build a visual representation of the dataset, focusing on the 15 most relevant metabolites to discriminate between the two groups. The heat map provides an intuitive description of the relationship between the samples and detected VOMs. The coloured representation of the cells corresponds to the concentration of the detected VOMs for each sample (dark blue, less concentrated; dark red, more concentrated).