2.1 | Dataset of SAVs
All of the SAV data were collected from CanProVar 2.0 (J. Li et al., 2011; Zhang et al., 2017), a human Cancer Proteome Variation Database. Single amino acid alterations, including both germline and somatic variations in the human proteome, are stored, notably including those related to the genesis or development of human cancer based on the published literature. Until now, there are 156,671 cancer-related SAVs and 967,017 neutral SAVs in the CanProVar 2.0. In order to find out the exact protein structure of SAV sequence, protein BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990) was used via searching Protein Data Bank proteins. There were five criteria in searching as following: 1. The e-value of alignment results should be smaller than 1e-50; 2. The alignment coverage of query protein should be higher than 95%; 3. The organism of the aligned target protein should be homo sapiens; 4. The experimental method of aligned target protein structures should be X-ray Diffraction; 5. The SAV position should be identically aligned between the wild type of SAV sequence and the aligned target protein. Then, CD-HIT Suite (Huang, Niu, Gao, Fu, & Li, 2010) was used to filter out the homologous proteins by the sequence identity cut-off 0.3. After that, 2,894 cancer-related SAVs and 7,668 neutral SAVs were remained and separated into twenty groups by the representative wild amino acid type of SAV. For each wild amino acid type, the number of cancer-related and neutral SAVs were listed in Table 1, and \(\delta\), the ratio of cancer to neutral was from 22.49% to 65.89%.