3. Results
3.1 Mapping amino acid replacements in human missense mutations .
Structural Bioinformatics procedures to find signals that can help to understand the mechanisms of benign and pathogenic mutagenesis and the role of single amino acids in this process require suitable data sets. Hence, we have selected ClinVar [24] as the basis for our investigations, due to the massive variety of weekly updated information on clinically relevant mutations that this databank offers. As of June 10, 2020, ClinVar reports 789,266 mutations, which can be directly filtered to obtain 308,326 missense mutation items by applying theMolecular consequence options offered by the ClinVar web home page. All amino acid replacements found in the latter missense mutation dataset are reported in Table 1, apart from those entries which did not give either the natural amino acid or the replacing one. It is apparent that several replacements are not allowed, like 597 and 14 missense mutations that would imply changes between amino acids with codons differing respectively for two or three nucleotides. Genome sequencing errors should be mostly responsible for these findings, but it will not be considered further. Furthermore, the fact that 13 self-mutations are also included in Table 1 suggests that some additional control is needed, such as the At least one star from Review statusamong ClinVar filtering options.
Thus, we got a final number of 25,579 BMMs and 21,595 PMMs, which we normalized, despite their close similarity, for a direct comparison of amino acid replacements in the two datasets. Afterward, to correlate expected and experimentally obtained mutation distributions, we have compared values of the scoring matrix for amino acid substitutions with matrix elements obtained from the difference between BMM and PMM, see Table 2. Among the large variety of PFAM and BLOSUM substitution matrices, we have chosen BLOSUM62 [25], which is the default option for several sequence analysis procedures such as BLAST [26], see Table 2.
Three interesting features emerge from Table 2: a) only replacements among amino acids having codons with one nucleotide change are observed; b) in both datasets the most frequent missense mutations involve arginine; c) as expected, in the case of BMM the overall amount of normalized mutations are 3,829 (38,29%) favorable 3,925 (39,25%) less favorable and 2,246 (22,46%) unfavorable variants; in the case of PMM, instead, an opposite distribution is observed with 2,015 (20,15%) favorable, 3,362 (33,62%) less favorable and 4,623 (46,23%) unfavorable variants.
3.2 Mapping amino acid replacements in benign and pathogenic missense mutations .
Different missense mutation profiles were obtained for BMMs and PMMs. Apart from the large predominance of arginine in both sets of data, alanine is the second most frequent mutated amino acid among BMM dataset and glycine behaves similarly among PMMs, as summarized in Fig. 1.
The prevalence of arginine among all the missense mutations of our two data sets has been already observed [27] and ascribed mainly to the frequent presence of the 5’CpG dinucleotide along the DNA genomic sequence. CG moiety in genomic DNA has been observed to be prone to TG or CA mutations, due to the deamination of 5’ methyl-cytosine (28). Arginine, indeed, has four nucleotide triplets, out of its six codons, that include the CG dinucleotide. Thus, CG/TG and CG/CA mutations yield R/C, H, Q, W that are the most abundant replacements, see Tables 1 and 2. Alanine, proline, serine and threonine have also one CG dinucleotide in their codons accounting, at least in part, for the amino acid occurrence profile in BMM dataset. The pathological effects of glycine replacements cannot be discussed on the basis of DNA sequences, and structural analysis of collected data is needed, vide infra.
3.3 Structural analysis of benign and pathogenic missense mutations .
From PISA database [18] we have derived the topology of 3,018 BMMs and 5,641 PMMs, even though the number of BMMs is larger than PMMs in ClinVar, underlining that structural biologists are predominantly concerned on proteins involved in diseases. For all the structurally characterized missense mutations, we have analyzed also the solvent accessible surface areas by using POPS [20], labeling as surface-exposed amino acids those having an exposed area higher than 20%. Table 3 summarizes our results for arginine and glycine, the two amino acids that are most involved in pathological mutations. We have reported also the topological distributions obtained for all the other eighteen amino acids, which can be considered as average reference values. From a preliminary overlook of Table 3, it is apparent that, among PMMs, arginine is more abundant than glycine and all the other amino acids both in PISA-defined interfaces and in protein surfaces. Furthermore, well above the PMM average, arginine is frequently located in protein-DNA interfaces.