2.3 | Feature vectors sets
The descriptors of SAVs used for machine learning were classified into three classes, the sequence-based, structure-based, and micro-environment-based features sets. For the sequence-based feature set, 44 descriptors were extracted from the protein sequence and partitioned into three groups listed in Table 2. The first group was from the most generally used substitution index of wild type residue to mutation for the SAV residue. Three kinds of substitution index were used included the BLOSUM62 (Choi, Sims, Murphy, Miller, & Chan, 2012; Henikoff & Henikoff, 1992), PAM250 (D. T. Jones, Taylor, & Thornton, 1992), and position-specific scoring matrix (PSSM), which derived from PSI-BLAST (Altschul et al., 1997). The second group represented the conservation for each residue comparing to homologs. The fifteen evolutional entropy values derived from PSI-BLAST were used to denote a sliding window of length 15 centered on the SAV. Then the average entropy values for the window of length 15 and 5 centered on the SAV were also calculated. The third group was the amino acid compositions (AAC) (Chou, 2001) of fifteen residues peptide used to represent the composition of the neighbor residues for centered SAV. According to the physicochemical properties of residues, we used the following classification schemes (Yu, Chen, Lu, & Hwang, 2006) of amino acid compositions: H for polar (RKEDQN), neutral (GASTPHY), and hydrophobic (CVLIMFW); V for small (GASCTPD), medium (NVEQIL), and large (MHKFRYW); Z for low polarizability (GASDT), medium (CPNVEQIL), and high (KMHFRYW); P for low polarity (LIFWCMVY), neutral (PATGS), and high polarity (HQRKNED); F for acidic (DE), basic (HKR), polar (CGNQSTY), and nonpolar (AFILMPVW); E for acidic (DE), basic (HKR), aromatic (FWY), amide (NQ), small hydroxyl (ST), sulfur-containing (CM), aliphatic 1 (AGP), and aliphatic 2 (ILV). For clarity, these sequence-based descriptors were summarized in Table S1.
In the structure-based feature sets, there were thirteen descriptors extracted from PDB and DSSP (Cheng, Randall, Sweredoski, & Baldi, 2005; Kabsch & Sander, 1983). The b-factor value of Cα atom of SAVs was used as the first structure-based descriptor, which was the displacement of atoms from their mean position in a crystal structure diminishes the scattered X-ray intensity. The displacement may be the result of temperature-dependent atomic vibrations or static disorder in a crystal lattice. Additionally, the critical information of the related solvent accessibility, eight DSSP defined secondary structures element (e.g., H, B, E, G, I, T, S, and others), the energy of backbone hydrogen bonds for acceptor and donor, and disulfide bonding or not gathered from DSSP were also used. These structure-based descriptors were summarized in Table S2.
In the third feature set, the weighted contact number (WCN) model (Lin et al., 2008) was used to describe the micro-environment properties of SAVs. The weighted contact number model was a local packing density profile, and it was reported that the WCN profile has a high correlation with the sequence conservation profile (Shih, Chang, Lin, Lo, & Hwang, 2012). The WCN value of atom \(i\) was calculated by\(\text{WCN}_{i}=\sum_{j\neq i}^{N}\frac{1}{r_{\text{ij}}^{2}}\), where \(r_{\text{ij}}\) was the distance between the atom \(i\) and other atom \(j\), \(N\) was the number of calculated atoms. In this work, atom \(i\) was defined as the \(C_{\alpha}\) atom of SAV, and the different micro-environment properties were represented by calculated different atom type or source of atom \(j\). The atom type of \(j\)could be \(C_{\alpha}\) atoms, nitrogen atoms or oxygen atoms of an amino acid. The source of atom \(j\) could also be from the same protein chain with SAV or whole protein to represent the packing density of SAV. Moreover, the source could also be from the other protein chain or molecules such as DNA, RNA, ligand, or metal ion to represent the protein-protein or protein-molecule interaction. The packing density of SAV could be divided into different classification represented the micro-environment properties where the SAV located in, e.g. polar, hydrophobic, acidic or basic et al . according to the physicochemical properties of residues where \(C_{\alpha}\) atom \(j\)belongs to. The same classification schemes were used as described in the sequence-based feature set, and the micro-environment-based descriptors were listed in Table S3.