3. Results
3.1 Mapping amino acid replacements in human missense mutations .
Structural Bioinformatics procedures to find signals that can help to
understand the mechanisms of benign and pathogenic mutagenesis and the
role of single amino acids in this process require suitable data sets.
Hence, we have selected ClinVar [24] as the basis for our
investigations, due to the massive variety of weekly updated information
on clinically relevant mutations that this databank offers. As of June
10, 2020, ClinVar reports 789,266 mutations, which can be directly
filtered to obtain 308,326 missense mutation items by applying theMolecular consequence options offered by the ClinVar web home
page. All amino acid replacements found in the latter missense mutation
dataset are reported in Table 1, apart from those entries which did not
give either the natural amino acid or the replacing one. It is apparent
that several replacements are not allowed, like 597 and 14 missense
mutations that would imply changes between amino acids with codons
differing respectively for two or three nucleotides. Genome sequencing
errors should be mostly responsible for these findings, but it will not
be considered further. Furthermore, the fact that 13 self-mutations are
also included in Table 1 suggests that some additional control is
needed, such as the At least one star from Review statusamong ClinVar filtering options.
Thus, we got a final number of 25,579 BMMs and 21,595 PMMs, which we
normalized, despite their close similarity, for a direct comparison of
amino acid replacements in the two datasets. Afterward, to correlate
expected and experimentally obtained mutation distributions, we have
compared values of the scoring matrix for amino acid substitutions with
matrix elements obtained from the difference between BMM and PMM, see
Table 2. Among the large variety of PFAM and BLOSUM substitution
matrices, we have chosen BLOSUM62 [25], which is the default option
for several sequence analysis procedures such as BLAST [26], see
Table 2.
Three interesting features emerge from Table 2: a) only replacements
among amino acids having codons with one nucleotide change are observed;
b) in both datasets the most frequent missense mutations involve
arginine; c) as expected, in the case of BMM the overall amount of
normalized mutations are 3,829 (38,29%) favorable 3,925 (39,25%) less
favorable and 2,246 (22,46%) unfavorable variants; in the case of PMM,
instead, an opposite distribution is observed with 2,015 (20,15%)
favorable, 3,362 (33,62%) less favorable and 4,623 (46,23%)
unfavorable variants.
3.2 Mapping amino acid replacements in benign and pathogenic
missense mutations .
Different missense mutation profiles were obtained for BMMs and PMMs.
Apart from the large predominance of arginine in both sets of data,
alanine is the second most frequent mutated amino acid among BMM dataset
and glycine behaves similarly among PMMs, as summarized in Fig. 1.
The prevalence of arginine among all the missense mutations of our two
data sets has been already observed [27] and ascribed mainly to the
frequent presence of the 5’CpG dinucleotide along the DNA genomic
sequence. CG moiety in genomic DNA has been observed to be prone to TG
or CA mutations, due to the deamination of 5’ methyl-cytosine (28).
Arginine, indeed, has four nucleotide triplets, out of its six codons,
that include the CG dinucleotide. Thus, CG/TG and CG/CA mutations yield
R/C, H, Q, W that are the most abundant replacements, see Tables 1 and
2. Alanine, proline, serine and threonine have also one CG dinucleotide
in their codons accounting, at least in part, for the amino acid
occurrence profile in BMM dataset. The pathological effects of glycine
replacements cannot be discussed on the basis of DNA sequences, and
structural analysis of collected data is needed, vide infra.
3.3 Structural analysis of benign and pathogenic missense
mutations .
From PISA database [18] we have derived the topology of 3,018 BMMs
and 5,641 PMMs, even though the number of BMMs is larger than PMMs in
ClinVar, underlining that structural biologists are predominantly
concerned on proteins involved in diseases. For all the structurally
characterized missense mutations, we have analyzed also the solvent
accessible surface areas by using POPS [20], labeling as
surface-exposed amino acids those having an exposed area higher than
20%. Table 3 summarizes our results for arginine and glycine, the two
amino acids that are most involved in pathological mutations. We have
reported also the topological distributions obtained for all the other
eighteen amino acids, which can be considered as average reference
values. From a preliminary overlook of Table 3, it is apparent that,
among PMMs, arginine is more abundant than glycine and all the other
amino acids both in PISA-defined interfaces and in protein surfaces.
Furthermore, well above the PMM average, arginine is frequently located
in protein-DNA interfaces.