*Correspondence:
Hari Mohan Pandey
Department of Computer Science, Edge Hill University, Ormskirk, Lancashire, UK.
Pandeyh@edgehill.ac.uk
Contact: +447414981569

Investigation of protein sequence similarity based on physio-chemical properties of amino acids

Abstract: Comparison of protein sequence similarity is a significant study. By virtue of this method, we can expose the evolutionary relationship among protein sequences. So, it is required to design effective computational algorithms that can compare the similarities among the colossal amount of sequences. The aim of this research is to develop efficient tools in the field of protein sequences comparison and phylogenetic study. The proposed method performs a feature generation process based on the physio-chemical properties of amino acids that best describes the revolutionary relationship among the species in a protein family. The protein sequences are transferred into an Eighty dimensional feature vector among the group of amino acids. Finally, four different datasets were used to validate the accuracy of the proposal and a correlation coefficient of \(0.94417\) of ND5 dataset using ClustalW has been found. This is much higher than some of the methods. At last the result explains the effectiveness in the similarity analysis among genome sequences.
Keywords: Sequence similarity, amino acids, Physio-chemical property, Markov Chain transition matrix.

Introduction

Research in the field of Computational Biology has seen significant growth in the last decades. This has derived rich data concerning protein sequence, structures, and gene expressions and aided in prediction and analysis of DNA. With this massive generation of protein sequences it is prudent to develop efficient tools for research in the field of Phylogenetic study. It is required to design effective computational algorithms that can compare the similarities among the colossal amount of sequences. Studies show that few protein sequences do not possess noteworthy sequence alignment similarities and are a great impediment to sequence comparisons and analysis [14]. This may be because of the unequal sequence lengths, inversion, transposition and translocation at sub-string level [45]. Thus, applying alignment-free methods will be a more realizable and cost-effective approach as they concentrate more on feature vectors for identifying attributes. These methods are realized in two steps. First, fixedâ\euro“length feature vectors are derived from protein sequences and then in the second step these vectors are provided as input to the similarity comparison algorithms. There are various approaches for creating serviceable datasets like predicting transcriptional activity of multiple site p53 mutants [31], predicting drug-target interaction networks [30], HIV cleavage sites in proteins [15], body fluids [33], antimicrobial peptides [54], colorectal cancer related genes[39], S-nitrosylation modification sites, protein sub cellular locations [40], and many more that prove to be very useful in the Phylogenetic study.
The sequence alignment is a very convoluted process and there are numerous efficacious methods to reduce its complexities and provide reliable results. One great method is for sequence comparison is graphical representation [20] that delineates the protein sequences aided with mathematical descriptors that help in recognizing the similarities between them. Numerical characterization [44] of protein sequences expresses the crux about their amino acid compositions. Each sequence is mapped to a distance frequency matrix and a similarity score is computed applying any effective distance measuring tool. With the k-string dictionary [60] protein sequences can be rendered with comparatively lower dimensional frequency on low cost. This is then inattentive by implementing singular value decomposition that provides a more precise vector representation of the protein sequences with the help of a tree. Fuzzy integrals [49] methods for similarity comparison earmark similarity scores within close intervals [0,1] for two selected sequences. A protein sequence inheres 20 amino acids. Protein sequences can be delineated by employing transition probability matrix, fuzzy measures and fuzzy integrals. Distance matrix can be derived by identified fuzzy integral similarities and a phylogenetic tree can be constructed with the data. The Chou’s pseudo amino acid composition [25] can also be utilized for alignment free similarity comparison. On the basis of the acquired proportion of amino acids, the distance between the foremost and every other amino acids, and the organization of the amino acids a 60-dimensional feature vector is derived. The phylogenetic tree is contrived out of this matrix. This proved to be economic in terms of space and time complexity as compared to other alignment free methods. Pseudo-Markov transition probabilities [43] among the 20 amino acids can also derive similarities among protein sequences. This method encodes the protein sequence into a\(440-D\) (dimensional) feature vector. This vector is comprised of a\(400-D\) Pseudo-Markov transition probability vector among the \(20\)amino acids, a \(20-D\) content ratio vector and a \(20-D\) position ratio vector of amino acids in the sequence. The protein sequences are compared by calculating the Euclidean distance between these vectors. Deploying Markov Chain parameters [52] is also a very effective method for similarity comparison. Markov chain parameters are evaluated based on the frequencies of occurrence of all the realizable pairs of amino acids for every alignment free gene sequences. These features can derive the similarity between two gene sequences utilizing a fuzzy integral algorithm. This algorithm has an advantage of more appropriate clustering performance for gene sequence comparison. \(H-\)curve [26] is very effective in analysis of local as well as global features of long protein sequences. The information derived from the nucleotide sequences is mapped from four letter language into a 3D space function called the H-curve. This curve also provides the integral information about the DNA. Alternating word frequency and normalized Lempel-Ziv complexity [65] also promise a cost-effective sequence similarity analysis. We can also approximate the protein sequence alignments to analyze the similarities and the max. segment pair score[1]. This method is extremely constructive and robust for analysis of long protein sequence databases, motif searches and gene identification searches. Protein fold prediction methods [56] use classifiers for deriving robust successive evolution knowledge from\(\text{PSI}-\text{BLAST}\) and \(\text{PSI}-\text{PRED}\) profiles and give a comprehensive feature set. This information is crucial for protein function analysis and structure prediction. A random forest (RF) classifier [55] pertained to a feature set can be used for both sequence and structure prediction using three large datasets. Positionâ\euro“based features [16] are also considered an efficient and reliable method for describing the distribution of amino acids. Studies show that evolution information provides an efficient way in protein sequence analysis [19, 38, 58, 64]. Methods of phylogenetic analysis contribute to function prediction and can be used in identifying similarities between life forms by analyzing their medicinal qualities [35].
Methods of alignment-free sequence analysis are very effective in phylogenetic classification of protein sequences, horizontal gene transfer recognition and discovering recombined sequences. These methods are also economic in terms of computations as they are usually of linear complexity and influenced by length of query sequence [8]. As compared to alignment based methods, these methods are not subject to presumptions of evolutionary trajectories of sequence changes. These methods are mathematically justifiable through linear algebra and information theory. Most of the methods can easily be applied using standard tree-building software [22, 37]. The alignment-free algorithms are expanding their applications in phylogenomics and horizontal gene transfer [6], population genetics [27] and relations between genome and epigenome [48]. These methods have evolved and improved their performance in the last decades [7] still there are confrontations for the number of effective bench marking approaches for alignment free similarity analysis [68]. The sample data-sets available [18] is outpacing the storage and processing capacities of the computers used today for research. Alignment free methods proliferate over primary next-generation sequencing applications[5, 53, 67, 50] and can efficiently derive biological data from next-generation raw data.
These alignment-free techniques have also some limitations such as it is difficult to classify with the concerned properties of a protein sequence to a specific cluster. But still, they are much better for pattern recognition with the known protein clusters. They have much better potential as when compared to alignment-based techniques which are applicable to a number of applications in bio-informatics. In this paper these problems have been tried to resolve based on by introducing three different quantitative methods and the physiochemical properties of amino acids which leads the desired results. Based on theses methods a feature vector consist of \(80\) features have been generated for different species. Here, Euclidean distance has been used to measure the distance between two feature vector \(P\) and \(S\). From which some proximity results have been observed among the species and reported in more precise fashion. As a final point, our proposed technique is more precise than a few offered techniques for comparison analysis on the ND5 and ND6 dataset in intricate level, and phylogenetic tree obtained using this method are find accurate on the F10 and G11 dataset.
In brief, the contributions of this research work are summarized as follows:
  1. Characterization of amino acids based on their chemical properties: From the large range of physio-chemical properties of amino acids, side chain effect renders an important role for formation of tertiary structure of proteins. According to the side chain effect of amino acids, these amino acids are classified into eight different groups. So, each primary protein sequences renders into another structure for further analysis.
  2. Procedure for obtaining feature vector based of Markov Chain transition matrix: Identification of feature vectors based on the transition probability among the amino acid group is possible. A unique procedure is devised to generate a feature vector of \(64\)features based the characteristic of physio-chemical properties of amino acids of ND5, ND6 protein families and ten species each from G10 and F11 protein families have been studied in this research.
  3. Procedure for obtaining feature vector based of content ratio and distribution ratio of amino acid groups: Based on physiochemical properties of amino acid groups an eight dimension content ratio vector explaining the frequency of each group and another eight dimension vector describing the position distribution of each group have devised. Further, we study the content ratio and distribution ratio among the eight groups of amino acids of ND5, ND6, G10 and F11 protein families. In the following section these points are elaborated briefly.
The rest of this article is arranged as follows: In Section 2, definition of different fundamental parameters with the appropriate description of the employed method. The experimental results and discussions have been established with the usefulness of our proposed method in Section 3. Section 4 finish off this paper with highlighting the key factors of the intact analysis.

The intended Methods and Materials

In this section, three different novel methods have been proposed to analyze primary protein sequences in intricate level on the basis of Markov chain transition matrix, content ratio and distribution of chemical groups of amino acids. These methods which are discussed in this section, have been explained for the sake of clarity on primary gene sequence obtained from different gene family.

Characterization of amino acids based on physio-chemical properties

The structure of protein sequences majorly depends on the physio-chemical properties of amino acids. These properties provide information about the coding region of the gene sequence [32] and about the function of the gene coded by the region [2]. The methods of similarity analysis in DNA and protein sequence serve as effective tools in phylogenetics research [46]. The primary protein sequence is consist of twenty amino acids represented by characters as show in Table 1. They act an important role in the determination of three dimensional structure of proteins and hence the biological processes are depends upon the physio-chemical properties of amino acids [3, 10, 12, 51]. According to the side chain effect of amino acids listed in Table 1, these twenty amino acids can be classified into eight different groups as shown in Table 1 discussed in [17]. The distribution of eight types of amino acids describes protein primary structures. For better understanding the feature vector of protein primary structure the classification of amino acids is define in Equation 1.
\(P(S(i))=\left\{\par \begin{matrix}D&if\ S(i)=\{D,E\}\\ R&if\ S(i)=\{R,H,K\}\\ Y&if\ S(i)=\{F,Y,W\}\\ A&if\ S(i)=\{I,L,V,A,G\}\\ P&if\ S(i)=\{P\}\\ C&if\ S(i)=\{M,C\}\\ S&if\ S(i)=\{S,T\}\\ N&if\ S(i)=\{Q,N\}\\ &\\ \end{matrix}\right.\ \) (1)
Where \(S(i)\) represents \(i^{th}\) character in the given protein sequence and \(P(S(i))\) will represent the corresponding substitute of amino acid \(S(i)\). For example, for a given protein sequence\(S(i)=\text{AGMEQQTMPHERCSNPTTGHIRTF}\), the feature sequence of\(S(i)=P(S(i)=\text{AACDNNSCPRDRCSNPSSARARSY}.\) Composition and distribution of amino acids are two key factors of Protein sequences. This has been used in different areas such as protein similarity, classification on the basis of structure[63, 66, 36] or the chemical composition, identification of patterns among protein sequences. There are mainly two ways of representing the Protein sequences. As per proposed methods [41], these are discrete and Sequential. Both have their own shortcomings in their ways of representation. The sequential way of representation fails when protein does not have much sequence similarity to known protein sequences. In other way the loss of ordering is the main drawback. Thus a much novel way for the same has been proposed i.e a \(80\)-D vector involving both of the features has been taken into consideration.

Construction of 64-D vector using Markov chain transition matrix

Let \(S=s_{1},s_{2},s_{3},\ldots,s_{n}\) be a gene sequence of length\(n\) characterized on amino acids \(A=a_{1},a_{2},\ldots,a_{8}\) a set of \(8\) Alphabets representing each amino acid group as defined in equation 1. For \(1<=i<=8\), a amino acid is said to appear at some position \(k\) in the protein sequence \(S\), if \(S(k)=A(i)\)and for \(1<=j<=8\), a pair of amino acid\(A\left(i\right)A\left(j\right)\) is said to occur at adjacent position \(k,l\) in the protein sequence \(S\), if\(S\left(i\right)S\left(j\right)=A\left(i\right)A\left(j\right)\). Correspondingly, the \(64\)-dimension vector is defined\((P_{11},P_{12},\ldots,P_{88})\). Here \(P\) is a\((8\times 8)\)-matrix with elements\(\{P_{i,j}:i,j=1,2,\ldots,8\}\). A random process\((X_{0},X_{1},\ldots,X_{8})\) with finite state space\(S=\{s_{0},s_{1}\ldots,s_{8}\}\) is said to be a Markov chain transition matrix \(P\) [24]. If for all \(n\), all\(i,j\in 1,\ldots,8\) and all \(i_{1},i_{2},\ldots,i_{8}\), here\(P(X_{n+1}=s_{j}|X_{1}=s_{i_{1}},X_{1}=s_{i_{1}},\ldots,X_{n}=s_{i_{n}},X_{n}=s_{i_{n}})=\)\(P(X_{n}=s_{j}|X_{n}=s_{i})=P_{i,j}\).
The elements of matrix \(P\) are called transition probability. The element \(P_{i,j}\) is the conditional probability of being in state\(s_{j}\) given that we are in state \(s_{i}\), where\(\{i,j\}\in\{1,\ldots,8\}\) for eight group of amino acids is defined in equation 2.
\(P_{\text{ij}}=\left\{\par \begin{matrix}\frac{n_{\text{ij}}}{n_{i}}&\text{if}A_{i}\neq S_{N}\\ \frac{n_{\text{ij}}}{n_{i}-1}&\text{if}A_{i}=S_{N}\\ &\\ \end{matrix}\right.\ \) (2)
Here, \(P_{\text{ij}}=0\) if \(n_{i}=0\) for some \(A_{i}\) or for some \(A_{i}\) which has frequency equal to one and appears at the end of the sequence.
\(\sum_{j=1}^{8}n_{\text{ij}}=\left\{\par \begin{matrix}n_{i}&\text{if}A_{i}\neq S_{N}\\ n_{i}-1&\text{if}A_{i}=S_{N}\\ &\\ \end{matrix}\right.\ \)
\(\sum_{j=1}^{8}n_{\text{ij}}=\left\{\par \begin{matrix}n_{i}&\text{if}A_{i}\neq S_{i}\\ n_{i}-1&\text{if}A_{i}=S_{j}\\ &\\ \end{matrix}\right.\ \)
In the context of amino acid sequence, Markove chain transition probability matrix can be expressed as:
\(P_{\text{ij}}=\par \begin{bmatrix}P_{1,1},P_{1,2},\cdots,\cdots,P_{1,8}\\ P_{2,1},P_{2,2},\cdots,\cdots,P_{2,8}\\ P_{3,1},P_{3,2},\cdots,\cdots,P_{3,8}\\ :,:,\cdots,\cdots,:\\ P_{8,1},P_{8,2}\cdots\cdots,P_{8,8}\\ \\ \end{bmatrix}\)

Construction of 8-D Content Ratio Vector based on the physiochemical properties of amino acids

The primary structure of a protein sequence is consist of \(20\) amino acids. Considering that a protein sequence is composed of \(8\) amino acids, as defined in equation 1,for each amino acid present in the given sequence content ratio \(C_{i}\) where \(1<=i<=8\) is defined as in equation 3.
\(C_{i}=\frac{c_{i}}{N}\text{\ \ }wh\text{ere}\text{\ \ }\sum_{i=1}^{8}n_{i}=N\)(3)
It is very much clear that this vector\(C(C_{1},C_{2},C_{3},\ldots,C_{8})\) will sum equal to \(1\). This parametric quantity uniquely is not an adequate parameter for gene sequence comparison because gene sequences having the same number of amino acids placed at different positions are not quite similar in nature. Therefore, this add another \(8\) feature values in the 80-D feature vector.

Construction of 8-D Distribution Ratio Vector based on the physio-chemical properties of amino acids

The two parameters discussed above can not define any protein sequence uniquely. Thus a new parameter of distribution for each amino acid \(i\), \(1<=i<=8\), is introduced which will differentiate between the gene sequences even if they have same content ratio and probability of adjacent amino acids. This vector will contribute to third part of\(80\)-D vector differentiating the protein sequences. The distribution vector is defined as follows:
\(\delta_{i}=\sum_{j=1}^{\alpha_{i}}\frac{(\tau_{i}-\eta_{i})^{2}}{\alpha_{i}},\text{\ \ }wh\text{ere}\text{\ \ }\eta_{i}=\frac{\beta_{i}}{\alpha_{i}}\text{\ \ }\text{and}\text{\ \ }\beta_{i}=\sum_{j=1}^{\alpha_{i}}\tau_{j}\)(4)
Here, \(\tau_{j}\) represents as the distance of \(j^{th}\) amino acid from the first position amino acid in the gene sequence. This signifies the distance between each amino acid of the sequence from the first position amino acid. The variable \(\alpha_{i}\) and \(\beta_{i}\) is defined as the count and sum of the position of \(i^{th}\) amino acid in the protein sequence respectively. The variable \(\beta_{i}\) represents the distance between eight group of amino acids from the first group of the amino acid. But this parameter is sometimes appear to be same for dissimilar protein sequences as well. For example, an amino acid group position in a protein sequences is at \(4^{th}\) and \(6^{th}\) position in one sequence and in other is at \(3^{\text{rd}}\) and \(7^{th}\) from first amino acid group in the sequence. Here, in both the cases the distance from first amino acid group in both of the sequences is \(10\), but are at different places. In order to consider the protein sequences uniquely this distribution parameter has been taken into consideration. Similarly, \(\eta_{i}\) is represents as the ratio of sum of positions of \(i^{th}\) amino acid to the count of \(i^{th}\) amino acid in the protein sequence. We carry out several experiments to validate the accuracy of the proposed method in the following sections.

Analyzing Protein Sequences

In this proposed method, we applied Euclidean Distance to compute the distance among the feature vectors of protein sequences. Euclidian Distance is one of the simplest and most effective method, which has been used in many fields for measuring the distance like gene identification [23], tertiary protein structure comparison and constructions [47]. However, there are many other methods used for protein sequence comparison [9]. In our method we considered \(A\)and \(B\) be two protein Sequences, and \(V(A)\) and \(V(B)\) be two vectors representing their corresponding \(80\)-D vectors. Euclidean distance between these two vectors is defined as below:
\(d(A,B)=\sqrt{\sum_{i=1}^{80}(V_{S}[i]-V_{T}[i])^{2}}\)(5)
Where \(V(A)[i]\) and \(V(B)[i]\) represent the \(i_{th}\) entry in two vectors \(V(A)\) and \(V(B)\). Smaller distance between sequences refers to the closeness of the sequences.
  1. Experimental result and discussions

    Data set used with specification

In this article, the proposed methodology tested on four datasets like\(\text{ND}5\) proteins of nine different species taken for analysis like Human \((\mathcal{H})\), P-Chimpanzee \((\mathcal{\text{PC}})\), C-Chimpanzee \((\mathcal{\text{CC}})\), Gorilla (\(\mathcal{G})\ \), Fin Whale \((\mathcal{F}\mathcal{W})\), Blue Whale\((\mathcal{B}\mathcal{W})\), Rat \((\mathcal{R})\), Mouse\((\mathcal{M})\), and Opossum(\(\mathcal{O})\) which are listed in Table 2. These sequences have length between \(602\) to \(610\) base pairs(bps). NADH dehydrogenase sub-unit 6 protein family also taken into consideration including Human \((\mathcal{H})\), Chimpanzee\((\mathcal{C})\), Gorilla (\(\mathcal{G)}\), Wallaroo\((\mathcal{W})\), Harbor-seal (\(\mathcal{H}\mathcal{S})\), Gray-seal\((\mathcal{\text{GS}})\), Rat (\(\mathcal{R})\) and Mouse(\(\mathcal{M})\) for similarity analysis GenBank(www.ncbi. nlm.nih.gov) . These data set are standard bench mark data used for sequence similarity analysis for validation of different computational procedures. These data sets are used before in other approaches [59, 21, 28, 61]. Further, another two datasets \(F10\) glycoside hydrolase family with NCBI accession IDs: O59859, P56588, P33559, Q00177, P07986, P07528, P40943, P23556, P45703, and Q60041 and \(G11\) of glycoside hydrolase family with NCBI IDs: P33557, P55328, P55331, P45705, P26220, P55334, Q06562, P55332, P55333, and P17137 are also considered to validate the proposed method.

Analysis of similarity between nine different proteins of ND5

For illustration of our proposed method, similarity among all the species of \(\text{ND}5\) dataset. We calculated the Euclidean distance between all the nine \(\text{ND}5\) protein sequences as shown in Table 3. The data have been collected from theGenBank(www.ncbi.nlm.nih.gov) namely: Human (Homo sapiens, AP_000649), Gorilla (Gorilla gorilla, NP_008222), Common Chimpanzee (Pantroglodytes, NP_008196), Pygmy Chimpanzee (Pan paniscus, NP_008209), Fin Whale (Balaenopteraphysalus, NP_006899), Blue Whale (Balaenopteramusculus, NP_007066), Rat(Rattusnorvegicus, AP_004902), Mouse (Mus musculus, NP_904338), and Opossum (Didelphis virginiana, NP_007105) as shown in Table 2. From Table 3, we observed that the Euclidean distance between CC,PC, \(\mathcal{H}\) and \(\mathcal{G}\) are quite small as comparison to other species in the same family. So, these four species are more similar with each other. The distance between\(\mathcal{F}\mathcal{W}\) and \(\mathcal{B}\mathcal{W}\) is also small that they are more similar with each other. There is also a small distance between \(\mathcal{R}\) and \(\mathcal{M}\) indicates the evolutionary closeness between them. Where as the Opossum species has large distance among other species which indicated comparatively large evolutionary between them. Corresponding to these as input, phylogenetic tree has been constructed as shown in Figure 1. The length of the branches of tree represents the lineages but we are focused to find the close relatedness among different species. On comparison of our approach with other ones, it has been found that there exists consistency with the result of evolution and biological history.
Analysis of similarity between eight different proteins of ND6
In order to examine our proposed method a sequence genes from\(\text{ND}6\) (NADH dehydrogenase sub-unit 6 proteins) has been considered. The accession number of all the species are: Human\((\text{YP}\_003024037.1)\), Chimpanzee \((\text{NP}\_008197)\), Wallaroo \((\text{NP}\_007405)\), Gorilla \((\text{NP}\_008223)\), Harbour Seal \((\text{NP}\_006939)\), Rat \((\text{AP}\_004903)\), Mouse \((\text{NP}\_904339)\), and Grey Seal (NP_007080) , the naming convention of these genes are as shown in Table 4. Then, we calculated the distance matrix of these set of gene sequences as shown in Table 5. As per our observation from this distance matrix\(\mathcal{H}\), \(\mathcal{C}\) and \(\mathcal{G}\) are closely evolutionary related. The distance between \(\mathcal{H}\mathcal{S}\)and GS are also very small, that is to say, they are very much similar with each other as compare to other species in this family. The corresponding phylogenetic tree has been constructed and shown in Figure 2. Phylogenetic tree obtained using this distance matrix found accurate based on their biological and revolutionary relationship. However, it is not quite sensible to say that\(\mathcal{W}\) are much revolutionary close to \(\mathcal{H}\),\(\mathcal{C}\) and \(\mathcal{G}\). This may because of loss of some physio-chemical properties as well as biological information.
In addition to the above two family , we carried out other protein families like \(G10\) and \(F11\) of the xylanases containing glycoside hydrolase families \(10\) and \(11\) respectively in our experiment to examine the usefulness of our method. Specifically, the \(F10\) data set contains ten xylanases with NCBI accession IDs\(O59859\), \(P56588\), \(P33559\), \(Q00177\), \(P07986\), \(P07528\),\(P40943\), \(P23556\),\(P45703\) and \(Q60041\) respectively. The G11 data set also consists of ten xylanases with \(\text{NCB}:\text{IDs}\)\(P33557\), \(P55328\), \(P55331\), \(P45705\), \(P26220\), \(P55334\),\(Q06562\), \(P55332\), \(P55333\) and \(P17137\) respectively. Similar to \(\text{ND}5\) and \(\text{ND}6\), Euclidean distance of \(G10\) and\(F11\) sequences are computed as shown in Table 6 and Table 7. From Table 6, the NCBI- ID \(P55332\), \(P55333\),\(P45705\) and \(P17137\) are more similar as compare to others in the same family as they are more evolutionary. The corresponding phylogenetic tree is also generated by considering the distance matrix shown in Figure 3 and Figure 4 for \(G10\) and \(F11\) respectively. The phylogentic trees shows more consistent biological revolutionary relationship among all the species of \(G10\) and \(F11\) family.

The proposed method compare with other exiting methods.

The ClustalW platform is considered to be one of the most useful sequence alignment method for protein and DNA sequence analysis [57]. We have utilized the ClustalW multiple sequence alignment results and our proposed method results in form of distance matrices. In order to examine for the linear correlation among all proposed method and ClustalW method, the parametric based correlation analysis has been used. The greater the correlation coefficient between two sequence represents the stronger linear correlation. For \(\text{ND}5\) data set, the results have been listed in Table 8. On comparing the results with those in Table 3, it has been found that the biological and evolutionary relationship listed above is in accordance to known phylogeny relationship.
The Correlation Coefficient is defined as the strength of linear relationship between two vectors. It is defined as ratio of covariance of variables to their standard deviations. We Used this parametric based correlation analysis to test the linear correlation. To find the relationship among our methods and ClustalW method, correlation coefficient has been calculated between these two methods. For calculating the correlation coefficient, rows from Table 3 and Table 8 has been taken into consideration. On taking the first row of similarity of Table 3 and similarity matrix of ClustalW(Table 8), correlation coefficient has been found to be \(0.91367\). Similarly for all other rows, correlation coefficient has been found, that has been listed in Table 9 and in Figure 5.
Let \(P\) and \(Q\) are to variable defined for positive integer. \(P\)and \(Q\) are said to be in linear correlation if the coefficient of correlation r satisfies \(r_{0.05}(n-2)<|r|<r_{0.01}(n-2)\). and are in strong linear correlation if \(|r|>r_{0.01}(n-2)\). on the basis of this, for considering \(\text{ND}5\) dataset where \(n=9\)and \(0.666<|r|<=0.798\). Here, the variable \(P\) and \(Q\) are said to be in linear correlation and when \(|r|>0.798\), \(P\) and\(Q\) are said to be in strong correlation. As a result, all species are in linearly correlated and except F-Whale. Considering our data sample size \(n=9\), which is too small, it implies that we may have high correlation coefficients. In order to validate our results, we examined significance analysis to check the strength of correlation between two sets. This analysis has been conducted for correlation coefficients greater than \(0.7\) through \(t-\)test. The value of alpha considered here for the significance analysis id \(0.05\) and corresponding \(t-\)value is \(2.365\). In Table 2 , we have considered only those t-values whose corresponding \(r\) values are greater than \(0.7\). On the basis of our computed results it can be said that \(r\) values do not occur by chance as all \(t-\) values are greater than \(2.365\). All the nine\(t-\) values satisfy \(t>2.365\) in our method, while there are only\(3\),\(4\), \(7\), \(6\), \(5\) \(t-\)values in other methods [29, 62, 42, 4, 11] respectively.
Similarly considering \(\text{ND}6\) dataset, the results for ClustalW has been listed in Table 11. On comparing this data with data of Table 3 , it has been found that the biological and evolutionary relationship found by method listed above is in accordance to known phylogeny relationship. To find the relationship among our method and ClustalW method, correlation coefficient has been calculated between these two methods. For calculating the correlation coefficient, corresponding rows from Table 3 and Table 11 has been taken into account. The rows of similarity/ dissimilarity of Table 3 and similarity/dissimilarity matrix of ClustalW(Table 11), correlation coefficient has been found and listed in Table 12 and in Figure 6.
Two variable \(P\) and \(Q\) are said to be in linear correlation if the coefficient of correlation \(r\) satisfies\(r_{0.05}(n-2)<|r|<r_{0.01}(n-2)\). and are in strong linear correlation if \(|r|>r_{0.01}(n-2)\). Following the same, for ND6 dataset, \(n=8\), which implies when \(0.707<|r|<=0.834\), so the variable \(P\) and \(Q\) are said to be in linear correlation and\(|r|>0.834\), so the variable \(P\) and \(Q\) are in strong correlation. On the basis of our results, all species of \(\text{ND}6\)are linearly correlated except \(H-\text{Seal}\), \(G-\text{Seal}\)and \(\mathcal{R}\). Comparing our results with ClustalW, we found out a strong correlation between two species. As the sample size \(n=8\), which is too small, it implies that we may have high correlation coefficients. The correlation coefficient is greater than \(0.708\)through \(t-\)test in our proposed method. The value of \(\alpha\)considered here for the significance analysis and id \(0.05\) and corresponding \(t-\) value is \(2.45\). In Table 12 , we have considered only those \(t-\text{values}\) whose corresponding \(r\)values are greater than \(0.707\). On the basis of our computed results it can be said that \(r\) values do not occur by chance as all \(t-\)values are greater than \(2.45\). Similarly, when we consider all seven\(t-\) values, which are satisfy \(t>2.45\) in our method, while there are only two \(t-\)values in other methods [34], [13] respectively.
The Time Complexity of the proposed method is \(O(n^{2})\). It is known that the multiple sequence alignment is an NP-hard Problem. Taking into consideration the space required in our method has been also reduced as it does not stores coordinates of the amino acids for the values of x and y coordinates equal to the sequence length.

Conclusions

The result declared in the previous sections, shows that the characterization of amino acid based on their physio-chemical properties could be considered as a significant scheme for similarity analysis of protein sequences. On the other hand, a \(80\) dimension feature vector has been devised on the basis of distribution of physio-chemical properties of amino acid. This association is a great advantage for understanding the similarity among gene sequences of different families. The result obtained for nine species of \(\text{ND}5\) and eight species of \(\text{ND}6\) proved that our method is simple, convenient, intuitive and computationally less intensive. We also observed that the phylogenetic tree obtained by this method shows much biological and revolutionary relationship among the species. Further, our method tested on \(G10\) and \(F11\) data set of ten species each which shows appropriate phylogeny. We believe that the novel features and the result reported in this article will be useful for biologist in the similar problems related to DNA and RNA sequences.