Feature Selection and Machine Learning Model Development
A variety of classification models were trained to determine the most
suitable one for further development. At the initial stage, all the
selected metabolites in the univariate analysis were included as feature
sets to train models. As a result, although all models can achieve
perfect performance in the first cohort of 254 cases with an accuracy of
no less than 90 % (Table S5 ), their performances on the second
cohort (as the unseen cases) differed from one to another. The SVM
achieved the general accuracy at 86% with the maximum area under curve
(AUC) value at 0.86 (95% CI: 0.82-0.90). From the receiver operating
characteristic (ROC) curves, SVM also gains the highest diagnostic
performance with a sensitivity and specificity both at 84 %
(Fig. 3A ). Therefore, SVM was selected as the optimal model for
further tuning.
Feature selection is a critical step to avoid overfitting by reducing
the model complexity. Recalling that all metabolite ions that have
statistical significance between the two groups, there were various
possibilities for feature selection and combination for model
development. To achieve a more robust machine learning model, it is
necessary to select the optimal set of metabolites as the
characteristic. For this purpose, we choose a wrapper-type feature
selection strategy that evaluates the chosen machine learning model’s
performance after training with different
candidate feature subsets.33 Briefly, the absolute
weights of the 66 metabolite ions in the initial SVM model were ranked
to evaluate their discriminating powers. Then the training sets with
features consisting of the top 60, 50, 40, 30, 20, 15, 10, 5, 2
metabolite ions were composed and trained in the first cohort. As is
shown in Fig. 3B , the SVM model’s performance with different
feature subsets maintained stable behavior for the training set whereas
the accuracy greatly dropped in the test set when the numbers of
features were less than 15.
The relative expression levels of these 15 metabolite ions in the test
set are shown in Fig. 3C and Table S6 . These 15
metabolite ions also had statistical significance no matter in the
development cohort or validation cohort, proving their feasibility as
the clinical markers. The classification result on the test set was
visualized in the dimension-reduced space composed of the first two
principal components, in which it is seen that more than 500 HC and OSCC
samples can be ideally separated (Fig. 3D ). The confusion
matrix showed that the optimal SVM model can obtain a true positive rate
at 94 % for OSCC detection (Fig. 3E ). The final prediction
accuracy reached 89.6 % on the test set (Table S7 ).