2.2 | Prediction systems
In this work, two cancer-related SAV prediction systems were built by
the machine learning method. The first system,
CanSavPrew, contained twenty individual prediction
models constructed from twenty groups according to the wild amino acid
type of SAV. In the second prediction system,
CanSavPrewm, every twenty groups were divided into
smaller sub-groups by its mutated amino acid type of SAV. For example,
an alanine should have a different prediction model with an acidic
(e.g., aspartate or glutamate) and a basic (e.g., arginine, lysine, or
histidine) mutated amino acid type due to their essential factors of SAV
should be distinct. Finally, 100 prediction models were built in the
second prediction system.
Each prediction model was a two-level Support Vector Machine (SVM)
(Chang & Lin, 2011) classifier modules.
The first level SVM comprised twelve SVM classifiers based on the three
specific feature sets, as sequence-based, structure-based, and
micro-environment-based feature sets, which described in the next
section, respectively. For each feature set, four fitness functions
(Equations 1-4 ) were used for feature selection and performance
optimization using the genetic algorithm
(Lu, Chen, Yu, & Hwang, 2007;
Yu & Lu, 2011).
Four informative measures for predictive performance were used as the
fitness functions, which were accuracy (Acc), Matthews correlation
coefficient (MCC), F1 score (F1) and summation of sensitivity and
weighted specificity (Hybrid) and were calculated by true positive
(TP ), true negative (TN ), false positive (FP ), and
false negative (FN ) values as follows:
\(Acc=\frac{TP+TN}{TP+TN+FP+FN}\ \), (1)\(MCC=\frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}}\ \),
(2)\(F1=\frac{2\times Precision\times Sensitivity}{Precision+Sensitivity}\ \),
(3)\(Hybrid=Sensitivity+\delta\times Specificity\ \), (4)
where \(Precision=\frac{\text{TP}}{TP+FP}\),\(Sensitivity=\frac{\text{TP}}{TP+FN}\),\(Specificity=\frac{\text{TN}}{TN+FP}\), TP is the true
positives, TN is the true negatives, FP is the false
positives, FN is the false negatives and \(\delta\) is the ratio
of the number of cancer-related to neutral SAV which listed in Table 1.
All of the descriptors of SAV were fed into SVM, and the five-fold
cross-validation was performed when the model training and testing.
The second level of SVM classifiers was used to process the prediction
results generated from twelve classifiers (three feature sets was
multiplied by four fitness functions) in the first level to produce the
final probability distribution of the relationship with cancer-related
or neutral. The relationship with the largest probability was used as
the final prediction. The two-level SVM system is shown schematically in
Figure 1.
FIGURE 1 The two-level SVM prediction system.