Feature Selection and Machine Learning Model Development
A variety of classification models were trained to determine the most suitable one for further development. At the initial stage, all the selected metabolites in the univariate analysis were included as feature sets to train models. As a result, although all models can achieve perfect performance in the first cohort of 254 cases with an accuracy of no less than 90 % (Table S5 ), their performances on the second cohort (as the unseen cases) differed from one to another. The SVM achieved the general accuracy at 86% with the maximum area under curve (AUC) value at 0.86 (95% CI: 0.82-0.90). From the receiver operating characteristic (ROC) curves, SVM also gains the highest diagnostic performance with a sensitivity and specificity both at 84 % (Fig. 3A ). Therefore, SVM was selected as the optimal model for further tuning.
Feature selection is a critical step to avoid overfitting by reducing the model complexity. Recalling that all metabolite ions that have statistical significance between the two groups, there were various possibilities for feature selection and combination for model development. To achieve a more robust machine learning model, it is necessary to select the optimal set of metabolites as the characteristic. For this purpose, we choose a wrapper-type feature selection strategy that evaluates the chosen machine learning model’s performance after training with different candidate feature subsets.33 Briefly, the absolute weights of the 66 metabolite ions in the initial SVM model were ranked to evaluate their discriminating powers. Then the training sets with features consisting of the top 60, 50, 40, 30, 20, 15, 10, 5, 2 metabolite ions were composed and trained in the first cohort. As is shown in Fig. 3B , the SVM model’s performance with different feature subsets maintained stable behavior for the training set whereas the accuracy greatly dropped in the test set when the numbers of features were less than 15.
The relative expression levels of these 15 metabolite ions in the test set are shown in Fig. 3C and Table S6 . These 15 metabolite ions also had statistical significance no matter in the development cohort or validation cohort, proving their feasibility as the clinical markers. The classification result on the test set was visualized in the dimension-reduced space composed of the first two principal components, in which it is seen that more than 500 HC and OSCC samples can be ideally separated (Fig. 3D ). The confusion matrix showed that the optimal SVM model can obtain a true positive rate at 94 % for OSCC detection (Fig. 3E ). The final prediction accuracy reached 89.6 % on the test set (Table S7 ).