2.1 | Dataset of SAVs
All of the SAV data were collected from CanProVar 2.0
(J. Li et al., 2011;
Zhang et al., 2017), a human Cancer
Proteome Variation Database. Single amino acid alterations, including
both germline and somatic variations in the human proteome, are stored,
notably including those related to the genesis or development of human
cancer based on the published literature. Until now, there are 156,671
cancer-related SAVs and 967,017 neutral SAVs in the CanProVar 2.0. In
order to find out the exact protein structure of SAV sequence, protein
BLAST (Altschul, Gish, Miller, Myers, &
Lipman, 1990) was used via searching Protein Data Bank proteins. There
were five criteria in searching as following: 1. The e-value of
alignment results should be smaller than 1e-50; 2. The alignment
coverage of query protein should be higher than 95%; 3. The organism of
the aligned target protein should be homo sapiens; 4. The experimental
method of aligned target protein structures should be X-ray Diffraction;
5. The SAV position should be identically aligned between the wild type
of SAV sequence and the aligned target protein. Then, CD-HIT Suite
(Huang, Niu, Gao, Fu, & Li, 2010) was
used to filter out the homologous proteins by the sequence identity
cut-off 0.3. After that, 2,894 cancer-related SAVs and 7,668 neutral
SAVs were remained and separated into twenty groups by the
representative wild amino acid type of SAV. For each wild amino acid
type, the number of cancer-related and neutral SAVs were listed in Table
1, and \(\delta\), the ratio of cancer to neutral was from 22.49% to
65.89%.