Machine Learning Modeling
Two cohorts of OSCC and HC serum cases were recruited for the machine learning model development. For the OSCC screening modelling, the first cohort (100 HC + 154 OSCC) was used for classification model comparison and training. The 5-fold cross-validation was conducted in the first cohort to assess the model training performance. The MATLAB in-built APP “classification learner” was employed to select the optimal model for training and validation. A variety of classification models were investigated including linear discriminant analysis (LDA), logistics regression, decision tree (DT), naïve Bayesian (NB), supporting vector machine (SVM), k-nearest neighbor (KNN), and ensemble method. A confusion matrix was used to display the classification results and calculate the general accuracy, true positive rate (TPR), and positive prediction value (PPV). The F1 score was used as the single metric to assess different models’ fitting performance. Finally, the second cohort (141 HC and 424 OSCC) was used as the validation set. The area under curve (AUC), specificity and sensitivity were used as the metrics for comparing different machine learning models’ generalization ability to give a fair assessment of the pretrained model performance on the unseen data. For the OSCC staging study, the two cohorts of OSCC cases were combined to obtain sufficient samples for each stage (T1, n=139; T2, n=167; T3, n=128; T4, n=144). Then the 5-fold cross validation was used for evaluating the prediction accuracy.