2.2 | Prediction systems
In this work, two cancer-related SAV prediction systems were built by the machine learning method. The first system, CanSavPrew, contained twenty individual prediction models constructed from twenty groups according to the wild amino acid type of SAV. In the second prediction system, CanSavPrewm, every twenty groups were divided into smaller sub-groups by its mutated amino acid type of SAV. For example, an alanine should have a different prediction model with an acidic (e.g., aspartate or glutamate) and a basic (e.g., arginine, lysine, or histidine) mutated amino acid type due to their essential factors of SAV should be distinct. Finally, 100 prediction models were built in the second prediction system.
Each prediction model was a two-level Support Vector Machine (SVM) (Chang & Lin, 2011) classifier modules. The first level SVM comprised twelve SVM classifiers based on the three specific feature sets, as sequence-based, structure-based, and micro-environment-based feature sets, which described in the next section, respectively. For each feature set, four fitness functions (Equations 1-4 ) were used for feature selection and performance optimization using the genetic algorithm (Lu, Chen, Yu, & Hwang, 2007; Yu & Lu, 2011).
Four informative measures for predictive performance were used as the fitness functions, which were accuracy (Acc), Matthews correlation coefficient (MCC), F1 score (F1) and summation of sensitivity and weighted specificity (Hybrid) and were calculated by true positive (TP ), true negative (TN ), false positive (FP ), and false negative (FN ) values as follows:
\(Acc=\frac{TP+TN}{TP+TN+FP+FN}\ \), (1)\(MCC=\frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}}\ \), (2)\(F1=\frac{2\times Precision\times Sensitivity}{Precision+Sensitivity}\ \), (3)\(Hybrid=Sensitivity+\delta\times Specificity\ \), (4) where \(Precision=\frac{\text{TP}}{TP+FP}\),\(Sensitivity=\frac{\text{TP}}{TP+FN}\),\(Specificity=\frac{\text{TN}}{TN+FP}\), TP is the true positives, TN is the true negatives, FP is the false positives, FN is the false negatives and \(\delta\) is the ratio of the number of cancer-related to neutral SAV which listed in Table 1. All of the descriptors of SAV were fed into SVM, and the five-fold cross-validation was performed when the model training and testing.
The second level of SVM classifiers was used to process the prediction results generated from twelve classifiers (three feature sets was multiplied by four fitness functions) in the first level to produce the final probability distribution of the relationship with cancer-related or neutral. The relationship with the largest probability was used as the final prediction. The two-level SVM system is shown schematically in Figure 1.
FIGURE 1 The two-level SVM prediction system.