2.3 | Feature vectors sets
The descriptors of SAVs used for machine learning were classified into
three classes, the sequence-based, structure-based, and
micro-environment-based features sets. For the sequence-based feature
set, 44 descriptors were extracted from the protein sequence and
partitioned into three groups listed in Table 2. The first group was
from the most generally used substitution index of wild type residue to
mutation for the SAV residue. Three kinds of substitution index were
used included the BLOSUM62 (Choi, Sims,
Murphy, Miller, & Chan, 2012; Henikoff
& Henikoff, 1992), PAM250 (D. T. Jones,
Taylor, & Thornton, 1992), and position-specific scoring matrix
(PSSM), which derived from PSI-BLAST
(Altschul et al., 1997). The second group
represented the conservation for each residue comparing to homologs. The
fifteen evolutional entropy values derived from PSI-BLAST were used to
denote a sliding window of length 15 centered on the SAV. Then the
average entropy values for the window of length 15 and 5 centered on the
SAV were also calculated. The third group was the amino acid
compositions (AAC) (Chou, 2001) of
fifteen residues peptide used to represent the composition of the
neighbor residues for centered SAV. According to the physicochemical
properties of residues, we used the following classification schemes
(Yu, Chen, Lu, & Hwang, 2006) of amino
acid compositions: H for polar (RKEDQN), neutral (GASTPHY), and
hydrophobic (CVLIMFW); V for small (GASCTPD), medium (NVEQIL), and large
(MHKFRYW); Z for low polarizability (GASDT), medium (CPNVEQIL), and high
(KMHFRYW); P for low polarity (LIFWCMVY), neutral (PATGS), and high
polarity (HQRKNED); F for acidic (DE), basic (HKR), polar (CGNQSTY), and
nonpolar (AFILMPVW); E for acidic (DE), basic (HKR), aromatic (FWY),
amide (NQ), small hydroxyl (ST), sulfur-containing (CM), aliphatic 1
(AGP), and aliphatic 2 (ILV). For clarity, these sequence-based
descriptors were summarized in Table S1.
In the structure-based feature sets, there were thirteen descriptors
extracted from PDB and DSSP (Cheng,
Randall, Sweredoski, & Baldi, 2005;
Kabsch & Sander, 1983). The b-factor
value of Cα atom of SAVs was used as the first structure-based
descriptor, which was the displacement of atoms from their mean position
in a crystal structure diminishes the scattered X-ray intensity. The
displacement may be the result of temperature-dependent atomic
vibrations or static disorder in a crystal lattice. Additionally, the
critical information of the related solvent accessibility, eight DSSP
defined secondary structures element (e.g., H, B, E, G, I, T, S, and
others), the energy of backbone hydrogen bonds for acceptor and donor,
and disulfide bonding or not gathered from DSSP were also used. These
structure-based descriptors were summarized in Table S2.
In the third feature set, the weighted contact number (WCN) model
(Lin et al., 2008) was used to describe
the micro-environment properties of SAVs. The weighted contact number
model was a local packing density profile, and it was reported that the
WCN profile has a high correlation with the sequence conservation
profile (Shih, Chang, Lin, Lo, & Hwang,
2012). The WCN value of atom \(i\) was calculated by\(\text{WCN}_{i}=\sum_{j\neq i}^{N}\frac{1}{r_{\text{ij}}^{2}}\),
where \(r_{\text{ij}}\) was the distance between the atom \(i\) and
other atom \(j\), \(N\) was the number of calculated atoms. In this
work, atom \(i\) was defined as the \(C_{\alpha}\) atom of SAV, and the
different micro-environment properties were represented by calculated
different atom type or source of atom \(j\). The atom type of \(j\)could be \(C_{\alpha}\) atoms, nitrogen atoms or oxygen atoms of an
amino acid. The source of atom \(j\) could also be from the same protein
chain with SAV or whole protein to represent the packing density of SAV.
Moreover, the source could also be from the other protein chain or
molecules such as DNA, RNA, ligand, or metal ion to represent the
protein-protein or protein-molecule interaction. The packing density of
SAV could be divided into different classification represented the
micro-environment properties where the SAV located in, e.g. polar,
hydrophobic, acidic or basic et al . according to the
physicochemical properties of residues where \(C_{\alpha}\) atom \(j\)belongs to. The same classification schemes were used as described in
the sequence-based feature set, and the micro-environment-based
descriptors were listed in Table S3.