*Correspondence:
Hari Mohan Pandey
Department of Computer Science, Edge Hill University, Ormskirk,
Lancashire, UK.
Pandeyh@edgehill.ac.uk
Contact: +447414981569
Investigation of protein sequence similarity based on
physio-chemical properties of amino
acids
Abstract: Comparison of protein sequence similarity is a
significant study. By virtue of this method, we can expose the
evolutionary relationship among protein sequences. So, it is required to
design effective computational algorithms that can compare the
similarities among the colossal amount of sequences. The aim of this
research is to develop efficient tools in the field of protein sequences
comparison and phylogenetic study. The proposed method performs a
feature generation process based on the physio-chemical properties of
amino acids that best describes the revolutionary relationship among the
species in a protein family. The protein sequences are transferred into
an Eighty dimensional feature vector among the group of amino acids.
Finally, four different datasets were used to validate the accuracy of
the proposal and a correlation coefficient of \(0.94417\) of ND5 dataset
using ClustalW has been found. This is much higher than some of the
methods. At last the result explains the effectiveness in the similarity
analysis among genome sequences.
Keywords: Sequence similarity, amino acids, Physio-chemical
property, Markov Chain transition matrix.
Introduction
Research in the field of Computational Biology has seen significant
growth in the last decades. This has derived rich data concerning
protein sequence, structures, and gene expressions and aided in
prediction and analysis of DNA. With this massive generation of protein
sequences it is prudent to develop efficient tools for research in the
field of Phylogenetic study. It is required to design effective
computational algorithms that can compare the similarities among the
colossal amount of sequences. Studies show that few protein sequences do
not possess noteworthy sequence alignment similarities and are a great
impediment to sequence comparisons and analysis [14]. This may be
because of the unequal sequence lengths, inversion, transposition and
translocation at sub-string level [45]. Thus, applying
alignment-free methods will be a more realizable and cost-effective
approach as they concentrate more on feature vectors for identifying
attributes. These methods are realized in two steps. First,
fixedâ\euro“length feature vectors are derived from protein sequences
and then in the second step these vectors are provided as input to the
similarity comparison algorithms. There are various approaches for
creating serviceable datasets like predicting transcriptional activity
of multiple site p53 mutants [31], predicting drug-target
interaction networks [30], HIV cleavage sites in proteins [15],
body fluids [33], antimicrobial peptides [54], colorectal cancer
related genes[39], S-nitrosylation modification sites, protein sub
cellular locations [40], and many more that prove to be very useful
in the Phylogenetic study.
The sequence alignment is a very convoluted process and there are
numerous efficacious methods to reduce its complexities and provide
reliable results. One great method is for sequence comparison is
graphical representation [20] that delineates the protein sequences
aided with mathematical descriptors that help in recognizing the
similarities between them. Numerical characterization [44] of
protein sequences expresses the crux about their amino acid
compositions. Each sequence is mapped to a distance frequency matrix and
a similarity score is computed applying any effective distance measuring
tool. With the k-string dictionary [60] protein sequences can be
rendered with comparatively lower dimensional frequency on low cost.
This is then inattentive by implementing singular value decomposition
that provides a more precise vector representation of the protein
sequences with the help of a tree. Fuzzy integrals [49] methods for
similarity comparison earmark similarity scores within close intervals
[0,1] for two selected sequences. A protein sequence inheres 20
amino acids. Protein sequences can be delineated by employing transition
probability matrix, fuzzy measures and fuzzy integrals. Distance matrix
can be derived by identified fuzzy integral similarities and a
phylogenetic tree can be constructed with the data. The Chou’s pseudo
amino acid composition [25] can also be utilized for alignment free
similarity comparison. On the basis of the acquired proportion of amino
acids, the distance between the foremost and every other amino acids,
and the organization of the amino acids a 60-dimensional feature vector
is derived. The phylogenetic tree is contrived out of this matrix. This
proved to be economic in terms of space and time complexity as compared
to other alignment free methods. Pseudo-Markov transition probabilities
[43] among the 20 amino acids can also derive similarities among
protein sequences. This method encodes the protein sequence into a\(440-D\) (dimensional) feature vector. This vector is comprised of a\(400-D\) Pseudo-Markov transition probability vector among the \(20\)amino acids, a \(20-D\) content ratio vector and a \(20-D\) position
ratio vector of amino acids in the sequence. The protein sequences are
compared by calculating the Euclidean distance between these vectors.
Deploying Markov Chain parameters [52] is also a very effective
method for similarity comparison. Markov chain parameters are evaluated
based on the frequencies of occurrence of all the realizable pairs of
amino acids for every alignment free gene sequences. These features can
derive the similarity between two gene sequences utilizing a fuzzy
integral algorithm. This algorithm has an advantage of more appropriate
clustering performance for gene sequence comparison. \(H-\)curve
[26] is very effective in analysis of local as well as global
features of long protein sequences. The information derived from the
nucleotide sequences is mapped from four letter language into a 3D space
function called the H-curve. This curve also provides the integral
information about the DNA. Alternating word frequency and normalized
Lempel-Ziv complexity [65] also promise a cost-effective sequence
similarity analysis. We can also approximate the protein sequence
alignments to analyze the similarities and the max. segment pair
score[1]. This method is extremely constructive and robust for
analysis of long protein sequence databases, motif searches and gene
identification searches. Protein fold prediction methods [56] use
classifiers for deriving robust successive evolution knowledge from\(\text{PSI}-\text{BLAST}\) and \(\text{PSI}-\text{PRED}\) profiles
and give a comprehensive feature set. This information is crucial for
protein function analysis and structure prediction. A random forest (RF)
classifier [55] pertained to a feature set can be used for both
sequence and structure prediction using three large datasets.
Positionâ\euro“based features [16] are also considered an
efficient and reliable method for describing the distribution of amino
acids. Studies show that evolution information provides an efficient way
in protein sequence analysis [19, 38, 58, 64]. Methods of
phylogenetic analysis contribute to function prediction and can be used
in identifying similarities between life forms by analyzing their
medicinal qualities [35].
Methods of alignment-free sequence analysis are very effective in
phylogenetic classification of protein sequences, horizontal gene
transfer recognition and discovering recombined sequences. These methods
are also economic in terms of computations as they are usually of linear
complexity and influenced by length of query sequence [8]. As
compared to alignment based methods, these methods are not subject to
presumptions of evolutionary trajectories of sequence changes. These
methods are mathematically justifiable through linear algebra and
information theory. Most of the methods can easily be applied using
standard tree-building software [22, 37]. The alignment-free
algorithms are expanding their applications in phylogenomics and
horizontal gene transfer [6], population genetics [27] and
relations between genome and epigenome [48]. These methods have
evolved and improved their performance in the last decades [7] still
there are confrontations for the number of effective bench marking
approaches for alignment free similarity analysis [68]. The sample
data-sets available [18] is outpacing the storage and processing
capacities of the computers used today for research. Alignment free
methods proliferate over primary next-generation sequencing
applications[5, 53, 67, 50] and can efficiently derive biological
data from next-generation raw data.
These alignment-free techniques have also some limitations such as it is
difficult to classify with the concerned properties of a protein
sequence to a specific cluster. But still, they are much better for
pattern recognition with the known protein clusters. They have much
better potential as when compared to alignment-based techniques which
are applicable to a number of applications in bio-informatics. In this
paper these problems have been tried to resolve based on by introducing
three different quantitative methods and the physiochemical properties
of amino acids which leads the desired results. Based on theses methods
a feature vector consist of \(80\) features have been generated for
different species. Here, Euclidean distance has been used to measure the
distance between two feature vector \(P\) and \(S\). From which some
proximity results have been observed among the species and reported in
more precise fashion. As a final point, our proposed technique is more
precise than a few offered techniques for comparison analysis on the ND5
and ND6 dataset in intricate level, and phylogenetic tree obtained using
this method are find accurate on the F10 and G11 dataset.
In brief, the contributions of this research work are summarized as
follows:
- Characterization of amino acids based on their chemical
properties: From the large range of physio-chemical properties of
amino acids, side chain effect renders an important role for formation
of tertiary structure of proteins. According to the side chain effect
of amino acids, these amino acids are classified into eight different
groups. So, each primary protein sequences renders into another
structure for further analysis.
- Procedure for obtaining feature vector based of Markov
Chain transition matrix: Identification of feature vectors based on
the transition probability among the amino acid group is possible. A
unique procedure is devised to generate a feature vector of \(64\)features based the characteristic of physio-chemical properties of
amino acids of ND5, ND6 protein families and ten species each from G10
and F11 protein families have been studied in this research.
- Procedure for obtaining feature vector based of content
ratio and distribution ratio of amino acid groups: Based on
physiochemical properties of amino acid groups an eight dimension
content ratio vector explaining the frequency of each group and
another eight dimension vector describing the position distribution of
each group have devised. Further, we study the content ratio and
distribution ratio among the eight groups of amino acids of ND5, ND6,
G10 and F11 protein families. In the following section these points
are elaborated briefly.
The rest of this article is arranged as follows: In Section 2,
definition of different fundamental parameters with the appropriate
description of the employed method. The experimental results and
discussions have been established with the usefulness of our proposed
method in Section 3. Section 4 finish off this paper with highlighting
the key factors of the intact analysis.
The intended Methods and
Materials
In this section, three different novel methods have been proposed to
analyze primary protein sequences in intricate level on the basis of
Markov chain transition matrix, content ratio and distribution of
chemical groups of amino acids. These methods which are discussed in
this section, have been explained for the sake of clarity on primary
gene sequence obtained from different gene family.
Characterization of amino acids based on physio-chemical
properties
The structure of protein sequences majorly depends on the
physio-chemical properties of amino acids. These properties provide
information about the coding region of the gene sequence [32] and
about the function of the gene coded by the region [2]. The methods
of similarity analysis in DNA and protein sequence serve as effective
tools in phylogenetics research [46]. The primary protein sequence
is consist of twenty amino acids represented by characters as show in
Table 1. They act an important role in the determination of three
dimensional structure of proteins and hence the biological processes are
depends upon the physio-chemical properties of amino acids [3, 10, 12,
51]. According to the side chain effect of amino acids listed in Table
1, these twenty amino acids can be classified into eight different
groups as shown in Table 1 discussed in [17]. The distribution of
eight types of amino acids describes protein primary structures. For
better understanding the feature vector of protein primary structure the
classification of amino acids is define in Equation 1.
\(P(S(i))=\left\{\par
\begin{matrix}D&if\ S(i)=\{D,E\}\\
R&if\ S(i)=\{R,H,K\}\\
Y&if\ S(i)=\{F,Y,W\}\\
A&if\ S(i)=\{I,L,V,A,G\}\\
P&if\ S(i)=\{P\}\\
C&if\ S(i)=\{M,C\}\\
S&if\ S(i)=\{S,T\}\\
N&if\ S(i)=\{Q,N\}\\
&\\
\end{matrix}\right.\ \) (1)
Where \(S(i)\) represents \(i^{th}\) character in the given protein
sequence and \(P(S(i))\) will represent the corresponding substitute of
amino acid \(S(i)\). For example, for a given protein sequence\(S(i)=\text{AGMEQQTMPHERCSNPTTGHIRTF}\), the feature sequence of\(S(i)=P(S(i)=\text{AACDNNSCPRDRCSNPSSARARSY}.\) Composition and
distribution of amino acids are two key factors of Protein sequences.
This has been used in different areas such as protein similarity,
classification on the basis of structure[63, 66, 36] or the chemical
composition, identification of patterns among protein sequences. There
are mainly two ways of representing the Protein sequences. As per
proposed methods [41], these are discrete and Sequential. Both have
their own shortcomings in their ways of representation. The sequential
way of representation fails when protein does not have much sequence
similarity to known protein sequences. In other way the loss of ordering
is the main drawback. Thus a much novel way for the same has been
proposed i.e a \(80\)-D vector involving both of the features has been
taken into consideration.
Construction of 64-D vector using Markov chain transition
matrix
Let \(S=s_{1},s_{2},s_{3},\ldots,s_{n}\) be a gene sequence of length\(n\) characterized on amino acids \(A=a_{1},a_{2},\ldots,a_{8}\) a
set of \(8\) Alphabets representing each amino acid group as defined in
equation 1. For \(1<=i<=8\), a amino acid is said to appear at
some position \(k\) in the protein sequence \(S\), if \(S(k)=A(i)\)and for \(1<=j<=8\), a pair of amino acid\(A\left(i\right)A\left(j\right)\) is said to occur at adjacent
position \(k,l\) in the protein sequence \(S\), if\(S\left(i\right)S\left(j\right)=A\left(i\right)A\left(j\right)\).
Correspondingly, the \(64\)-dimension vector is defined\((P_{11},P_{12},\ldots,P_{88})\). Here \(P\) is a\((8\times 8)\)-matrix with elements\(\{P_{i,j}:i,j=1,2,\ldots,8\}\). A random process\((X_{0},X_{1},\ldots,X_{8})\) with finite state space\(S=\{s_{0},s_{1}\ldots,s_{8}\}\) is said to be a Markov chain
transition matrix \(P\) [24]. If for all \(n\), all\(i,j\in 1,\ldots,8\) and all \(i_{1},i_{2},\ldots,i_{8}\), here\(P(X_{n+1}=s_{j}|X_{1}=s_{i_{1}},X_{1}=s_{i_{1}},\ldots,X_{n}=s_{i_{n}},X_{n}=s_{i_{n}})=\)\(P(X_{n}=s_{j}|X_{n}=s_{i})=P_{i,j}\).
The elements of matrix \(P\) are called transition probability. The
element \(P_{i,j}\) is the conditional probability of being in state\(s_{j}\) given that we are in state \(s_{i}\), where\(\{i,j\}\in\{1,\ldots,8\}\) for eight group of amino acids is
defined in equation 2.
\(P_{\text{ij}}=\left\{\par
\begin{matrix}\frac{n_{\text{ij}}}{n_{i}}&\text{if}A_{i}\neq S_{N}\\
\frac{n_{\text{ij}}}{n_{i}-1}&\text{if}A_{i}=S_{N}\\
&\\
\end{matrix}\right.\ \) (2)
Here, \(P_{\text{ij}}=0\) if \(n_{i}=0\) for some \(A_{i}\) or for
some \(A_{i}\) which has frequency equal to one and appears at the end
of the sequence.
\(\sum_{j=1}^{8}n_{\text{ij}}=\left\{\par
\begin{matrix}n_{i}&\text{if}A_{i}\neq
S_{N}\\
n_{i}-1&\text{if}A_{i}=S_{N}\\
&\\
\end{matrix}\right.\ \)
\(\sum_{j=1}^{8}n_{\text{ij}}=\left\{\par
\begin{matrix}n_{i}&\text{if}A_{i}\neq
S_{i}\\
n_{i}-1&\text{if}A_{i}=S_{j}\\
&\\
\end{matrix}\right.\ \)
In the context of amino acid sequence, Markove chain transition
probability matrix can be expressed as:
\(P_{\text{ij}}=\par
\begin{bmatrix}P_{1,1},P_{1,2},\cdots,\cdots,P_{1,8}\\
P_{2,1},P_{2,2},\cdots,\cdots,P_{2,8}\\
P_{3,1},P_{3,2},\cdots,\cdots,P_{3,8}\\
:,:,\cdots,\cdots,:\\
P_{8,1},P_{8,2}\cdots\cdots,P_{8,8}\\
\\
\end{bmatrix}\)
Construction of 8-D Content Ratio Vector based on the
physiochemical properties of amino
acids
The primary structure of a protein sequence is consist of \(20\) amino
acids. Considering that a protein sequence is composed of \(8\) amino
acids, as defined in equation 1,for each amino acid present in the given
sequence content ratio \(C_{i}\) where \(1<=i<=8\) is defined as
in equation 3.
\(C_{i}=\frac{c_{i}}{N}\text{\ \ }wh\text{ere}\text{\ \ }\sum_{i=1}^{8}n_{i}=N\)(3)
It is very much clear that this vector\(C(C_{1},C_{2},C_{3},\ldots,C_{8})\) will sum equal to \(1\). This
parametric quantity uniquely is not an adequate parameter for gene
sequence comparison because gene sequences having the same number of
amino acids placed at different positions are not quite similar in
nature. Therefore, this add another \(8\) feature values in the 80-D
feature vector.
Construction of 8-D Distribution Ratio Vector based on
the physio-chemical properties of amino
acids
The two parameters discussed above can not define any protein sequence
uniquely. Thus a new parameter of distribution for each amino acid \(i\), \(1<=i<=8\), is introduced which will differentiate between the
gene sequences even if they have same content ratio and probability of
adjacent amino acids. This vector will contribute to third part of\(80\)-D vector differentiating the protein sequences. The distribution
vector is defined as follows:
\(\delta_{i}=\sum_{j=1}^{\alpha_{i}}\frac{(\tau_{i}-\eta_{i})^{2}}{\alpha_{i}},\text{\ \ }wh\text{ere}\text{\ \ }\eta_{i}=\frac{\beta_{i}}{\alpha_{i}}\text{\ \ }\text{and}\text{\ \ }\beta_{i}=\sum_{j=1}^{\alpha_{i}}\tau_{j}\)(4)
Here, \(\tau_{j}\) represents as the distance of \(j^{th}\) amino acid
from the first position amino acid in the gene sequence. This signifies
the distance between each amino acid of the sequence from the first
position amino acid. The variable \(\alpha_{i}\) and \(\beta_{i}\) is
defined as the count and sum of the position of \(i^{th}\) amino acid in
the protein sequence respectively. The variable \(\beta_{i}\) represents
the distance between eight group of amino acids from the first group of
the amino acid. But this parameter is sometimes appear to be same for
dissimilar protein sequences as well. For example, an amino acid group
position in a protein sequences is at \(4^{th}\) and \(6^{th}\) position
in one sequence and in other is at \(3^{\text{rd}}\) and \(7^{th}\) from
first amino acid group in the sequence. Here, in both the cases the
distance from first amino acid group in both of the sequences is \(10\),
but are at different places. In order to consider the protein sequences
uniquely this distribution parameter has been taken into consideration.
Similarly, \(\eta_{i}\) is represents as the ratio of sum of positions
of \(i^{th}\) amino acid to the count of \(i^{th}\) amino acid in the
protein sequence. We carry out several experiments to validate the
accuracy of the proposed method in the following sections.
Analyzing Protein
Sequences
In this proposed method, we applied Euclidean Distance to compute the
distance among the feature vectors of protein sequences. Euclidian
Distance is one of the simplest and most effective method, which has
been used in many fields for measuring the distance like gene
identification [23], tertiary protein structure comparison and
constructions [47]. However, there are many other methods used for
protein sequence comparison [9]. In our method we considered \(A\)and \(B\) be two protein Sequences, and \(V(A)\) and \(V(B)\) be two
vectors representing their corresponding \(80\)-D vectors. Euclidean
distance between these two vectors is defined as below:
\(d(A,B)=\sqrt{\sum_{i=1}^{80}(V_{S}[i]-V_{T}[i])^{2}}\)(5)
Where \(V(A)[i]\) and \(V(B)[i]\) represent
the \(i_{th}\) entry in two vectors \(V(A)\) and \(V(B)\). Smaller
distance between sequences refers to the closeness of the sequences.
Experimental result and
discussions
Data set used with
specification
In this article, the proposed methodology tested on four datasets like\(\text{ND}5\) proteins of nine different species taken for analysis
like Human \((\mathcal{H})\), P-Chimpanzee \((\mathcal{\text{PC}})\),
C-Chimpanzee \((\mathcal{\text{CC}})\), Gorilla (\(\mathcal{G})\ \), Fin
Whale \((\mathcal{F}\mathcal{W})\), Blue Whale\((\mathcal{B}\mathcal{W})\), Rat \((\mathcal{R})\), Mouse\((\mathcal{M})\), and Opossum(\(\mathcal{O})\) which are listed in
Table 2. These sequences have length between \(602\) to \(610\) base
pairs(bps). NADH dehydrogenase sub-unit 6 protein family also taken into
consideration including Human \((\mathcal{H})\),
Chimpanzee\((\mathcal{C})\), Gorilla (\(\mathcal{G)}\), Wallaroo\((\mathcal{W})\), Harbor-seal (\(\mathcal{H}\mathcal{S})\),
Gray-seal\((\mathcal{\text{GS}})\), Rat (\(\mathcal{R})\) and
Mouse(\(\mathcal{M})\) for similarity analysis GenBank(www.ncbi.
nlm.nih.gov) . These data set are standard bench mark data used for
sequence similarity analysis for validation of different computational
procedures. These data sets are used before in other approaches [59,
21, 28, 61]. Further, another two datasets \(F10\) glycoside hydrolase
family with NCBI accession IDs: O59859, P56588, P33559, Q00177, P07986,
P07528, P40943, P23556, P45703, and Q60041 and \(G11\) of glycoside
hydrolase family with NCBI IDs: P33557, P55328, P55331, P45705, P26220,
P55334, Q06562, P55332, P55333, and P17137 are also considered to
validate the proposed method.
Analysis of similarity between nine different proteins of
ND5
For illustration of our proposed method, similarity among all the
species of \(\text{ND}5\) dataset. We calculated the Euclidean distance
between all the nine \(\text{ND}5\) protein sequences as shown in Table
3. The data have been collected from theGenBank(www.ncbi.nlm.nih.gov) namely: Human (Homo sapiens,
AP_000649), Gorilla (Gorilla gorilla, NP_008222), Common Chimpanzee
(Pantroglodytes, NP_008196), Pygmy Chimpanzee (Pan paniscus,
NP_008209), Fin Whale (Balaenopteraphysalus, NP_006899), Blue Whale
(Balaenopteramusculus, NP_007066), Rat(Rattusnorvegicus, AP_004902),
Mouse (Mus musculus, NP_904338), and Opossum (Didelphis virginiana,
NP_007105) as shown in Table 2. From Table 3, we observed that the
Euclidean distance between CC,PC, \(\mathcal{H}\) and \(\mathcal{G}\) are quite
small as comparison to other species in the same family. So, these four
species are more similar with each other. The distance between\(\mathcal{F}\mathcal{W}\) and \(\mathcal{B}\mathcal{W}\) is also small
that they are more similar with each other. There is also a small
distance between \(\mathcal{R}\) and \(\mathcal{M}\) indicates the
evolutionary closeness between them. Where as the Opossum species has
large distance among other species which indicated comparatively large
evolutionary between them. Corresponding to these as input, phylogenetic
tree has been constructed as shown in Figure 1. The length of the
branches of tree represents the lineages but we are focused to find the
close relatedness among different species. On comparison of our approach
with other ones, it has been found that there exists consistency with
the result of evolution and biological history.
Analysis of
similarity between eight different proteins of ND6
In order to examine our proposed method a sequence genes from\(\text{ND}6\) (NADH dehydrogenase sub-unit 6 proteins) has been
considered. The accession number of all the species are: Human\((\text{YP}\_003024037.1)\), Chimpanzee \((\text{NP}\_008197)\),
Wallaroo \((\text{NP}\_007405)\), Gorilla \((\text{NP}\_008223)\),
Harbour Seal \((\text{NP}\_006939)\), Rat \((\text{AP}\_004903)\),
Mouse \((\text{NP}\_904339)\), and Grey Seal (NP_007080) , the
naming convention of these genes are as shown in Table 4. Then, we
calculated the distance matrix of these set of gene sequences as shown
in Table 5. As per our observation from this distance matrix\(\mathcal{H}\), \(\mathcal{C}\) and \(\mathcal{G}\) are closely
evolutionary related. The distance between \(\mathcal{H}\mathcal{S}\)and GS are also very small, that is to say, they
are very much similar with each other as compare to other species in
this family. The corresponding phylogenetic tree has been constructed
and shown in Figure 2. Phylogenetic tree obtained using this distance
matrix found accurate based on their biological and revolutionary
relationship. However, it is not quite sensible to say that\(\mathcal{W}\) are much revolutionary close to \(\mathcal{H}\),\(\mathcal{C}\) and \(\mathcal{G}\). This may because of loss of some
physio-chemical properties as well as biological information.
In addition to the above two family , we carried out other protein
families like \(G10\) and \(F11\) of the xylanases containing glycoside
hydrolase families \(10\) and \(11\) respectively in our experiment to
examine the usefulness of our method. Specifically, the \(F10\) data set
contains ten xylanases with NCBI accession IDs\(O59859\), \(P56588\), \(P33559\), \(Q00177\), \(P07986\), \(P07528\),\(P40943\), \(P23556\),\(P45703\) and \(Q60041\) respectively. The G11
data set also consists of ten xylanases with \(\text{NCB}:\text{IDs}\)\(P33557\), \(P55328\), \(P55331\), \(P45705\), \(P26220\), \(P55334\),\(Q06562\), \(P55332\), \(P55333\) and \(P17137\) respectively. Similar
to \(\text{ND}5\) and \(\text{ND}6\), Euclidean distance of \(G10\) and\(F11\) sequences are computed as shown in Table 6 and Table 7. From
Table 6, the NCBI- ID \(P55332\), \(P55333\),\(P45705\) and \(P17137\) are more similar as compare to others in the
same family as they are more evolutionary. The corresponding
phylogenetic tree is also generated by considering the distance matrix
shown in Figure 3 and Figure 4 for \(G10\) and \(F11\) respectively. The
phylogentic trees shows more consistent biological revolutionary
relationship among all the species of \(G10\) and \(F11\) family.
The proposed method compare with other exiting
methods.
The ClustalW platform is considered to be one of the most useful
sequence alignment method for protein and DNA sequence analysis
[57]. We have utilized the ClustalW multiple sequence alignment
results and our proposed method results in form of distance matrices. In
order to examine for the linear correlation among all proposed method
and ClustalW method, the parametric based correlation analysis has been
used. The greater the correlation coefficient between two sequence
represents the stronger linear correlation. For \(\text{ND}5\) data set,
the results have been listed in Table 8. On comparing the results with
those in Table 3, it has been found that the biological and evolutionary
relationship listed above is in accordance to known phylogeny
relationship.
The Correlation Coefficient is defined as the strength of linear
relationship between two vectors. It is defined as ratio of covariance
of variables to their standard deviations. We Used this parametric based
correlation analysis to test the linear correlation. To find the
relationship among our methods and ClustalW method, correlation
coefficient has been calculated between these two methods. For
calculating the correlation coefficient, rows from Table 3 and Table 8
has been taken into consideration. On taking the first row of similarity
of Table 3 and similarity matrix of ClustalW(Table 8), correlation
coefficient has been found to be \(0.91367\). Similarly for all other
rows, correlation coefficient has been found, that has been listed in
Table 9 and in Figure 5.
Let \(P\) and \(Q\) are to variable defined for positive integer. \(P\)and \(Q\) are said to be in linear correlation if the coefficient of
correlation r satisfies \(r_{0.05}(n-2)<|r|<r_{0.01}(n-2)\). and
are in strong linear correlation if \(|r|>r_{0.01}(n-2)\). on the
basis of this, for considering \(\text{ND}5\) dataset where \(n=9\)and \(0.666<|r|<=0.798\). Here, the variable \(P\) and \(Q\) are
said to be in linear correlation and when \(|r|>0.798\), \(P\) and\(Q\) are said to be in strong correlation. As a result, all species are
in linearly correlated and except F-Whale. Considering our data sample
size \(n=9\), which is too small, it implies that we may have high
correlation coefficients. In order to validate our results, we examined
significance analysis to check the strength of correlation between two
sets. This analysis has been conducted for correlation coefficients
greater than \(0.7\) through \(t-\)test. The value of alpha considered
here for the significance analysis id \(0.05\) and corresponding \(t-\)value is \(2.365\). In Table 2 , we have considered only those t-values
whose corresponding \(r\) values are greater than \(0.7\). On the basis
of our computed results it can be said that \(r\) values do not occur by
chance as all \(t-\) values are greater than \(2.365\). All the nine\(t-\) values satisfy \(t>2.365\) in our method, while there are only\(3\),\(4\), \(7\), \(6\), \(5\) \(t-\)values in other methods [29,
62, 42, 4, 11] respectively.
Similarly considering \(\text{ND}6\) dataset, the results for ClustalW
has been listed in Table 11. On comparing this data with data of Table 3
, it has been found that the biological and evolutionary relationship
found by method listed above is in accordance to known phylogeny
relationship. To find the relationship among our method and ClustalW
method, correlation coefficient has been calculated between these two
methods. For calculating the correlation coefficient, corresponding rows
from Table 3 and Table 11 has been taken into account. The rows of
similarity/ dissimilarity of Table 3 and similarity/dissimilarity matrix
of ClustalW(Table 11), correlation coefficient has been found and listed
in Table 12 and in Figure 6.
Two variable \(P\) and \(Q\) are said to be in linear correlation if the
coefficient of correlation \(r\) satisfies\(r_{0.05}(n-2)<|r|<r_{0.01}(n-2)\). and are in strong linear
correlation if \(|r|>r_{0.01}(n-2)\). Following the same, for ND6
dataset, \(n=8\), which implies when \(0.707<|r|<=0.834\), so the
variable \(P\) and \(Q\) are said to be in linear correlation and\(|r|>0.834\), so the variable \(P\) and \(Q\) are in strong
correlation. On the basis of our results, all species of \(\text{ND}6\)are linearly correlated except \(H-\text{Seal}\), \(G-\text{Seal}\)and \(\mathcal{R}\). Comparing our results with ClustalW, we found out a
strong correlation between two species. As the sample size \(n=8\),
which is too small, it implies that we may have high correlation
coefficients. The correlation coefficient is greater than \(0.708\)through \(t-\)test in our proposed method. The value of \(\alpha\)considered here for the significance analysis and id \(0.05\) and
corresponding \(t-\) value is \(2.45\). In Table 12 , we have
considered only those \(t-\text{values}\) whose corresponding \(r\)values are greater than \(0.707\). On the basis of our computed results
it can be said that \(r\) values do not occur by chance as all \(t-\)values are greater than \(2.45\). Similarly, when we consider all seven\(t-\) values, which are satisfy \(t>2.45\) in our method, while
there are only two \(t-\)values in other methods [34], [13]
respectively.
The Time Complexity of the proposed method is \(O(n^{2})\). It is known
that the multiple sequence alignment is an NP-hard Problem. Taking into
consideration the space required in our method has been also reduced as
it does not stores coordinates of the amino acids for the values of x
and y coordinates equal to the sequence length.
Conclusions
The result declared in the previous sections, shows that the
characterization of amino acid based on their physio-chemical properties
could be considered as a significant scheme for similarity analysis of
protein sequences. On the other hand, a \(80\) dimension feature vector
has been devised on the basis of distribution of physio-chemical
properties of amino acid. This association is a great advantage for
understanding the similarity among gene sequences of different families.
The result obtained for nine species of \(\text{ND}5\) and eight species
of \(\text{ND}6\) proved that our method is simple, convenient,
intuitive and computationally less intensive. We also observed that the
phylogenetic tree obtained by this method shows much biological and
revolutionary relationship among the species. Further, our method tested
on \(G10\) and \(F11\) data set of ten species each which shows
appropriate phylogeny. We believe that the novel features and the result
reported in this article will be useful for biologist in the similar
problems related to DNA and RNA sequences.