Machine Learning Modeling
Two cohorts of OSCC and HC serum cases were recruited for the machine
learning model development. For the OSCC screening modelling, the first
cohort (100 HC + 154 OSCC) was used for classification model comparison
and training. The 5-fold cross-validation was conducted in the first
cohort to assess the model training performance. The MATLAB in-built APP
“classification learner” was employed to select the optimal model for
training and validation. A variety of classification models were
investigated including linear discriminant analysis (LDA), logistics
regression, decision tree (DT), naïve Bayesian (NB), supporting vector
machine (SVM), k-nearest neighbor (KNN), and ensemble method. A
confusion matrix was used to display the classification results and
calculate the general accuracy, true positive rate (TPR), and positive
prediction value (PPV). The F1 score was used as the single metric to
assess different models’ fitting performance. Finally, the second cohort
(141 HC and 424 OSCC) was used as the validation set. The area under
curve (AUC), specificity and sensitivity were used as the metrics for
comparing different machine learning models’ generalization ability to
give a fair assessment of the pretrained model performance on the unseen
data. For the OSCC staging study, the two cohorts of OSCC cases were
combined to obtain sufficient samples for each stage (T1, n=139; T2,
n=167; T3, n=128; T4, n=144). Then the 5-fold cross validation was used
for evaluating the prediction accuracy.