Discussion

Added-value of integration of data across different sources

This is to our knowledge the first study that integrates genetic variation data from multiple databases on MECP2 . Despite best efforts of individual sources to reach the largest possible coverage, our results demonstrate that the number of usefully annotated variants increases when databases are combined. The greatest advantage of the integrated approach is therefore that more variants become available for further research and diagnosis. This is especially interesting for rare diseases which have relatively small study populations. By mapping to a common reference sequence, the information of different sources becomes comparable and we are getting closer to the “true” number of variants known. In this study, we were able to increase the previously estimated numbers of a few hundred RTT causing unique sequence variations to 863. However, databases, at least the active ones, get regular updates and input of data. In the time from the beginning of this study the number of variants in e.g. RettBase increased within six months from 4738 (March 2018, (Townend et al., 2018)) to 4757 (November 2018) to 4806 (NM_004992.3, April 2020). Consequently, the number of 863 known RTT causing variants is likely outdated when this study is published. We argue that it is unrealistic to assume that any single database will ever be completely comprehensive, unless it automatically pulls in updates from other databases. A possible contribution to the solution of this problem would be to create the combined list of pathogenic variants by automated workflows that find and summarize data from across databases on demand or continuously. To make that possible we need to standardize how databases provide data for machine processing. The role of FAIR data principles to achieve this is discussed later in more detail.
This integrated dataset gives the possibility to study abundance and prevalence of certain variations in a larger population than any of the study populations published before. There are several studies on relatively small (Das, Raha, Sanghavi, Maitra, & Udani, 2013; Inui et al., 2001) or large populations (e.g. (Bienvenu et al., 2002; Percy et al., 2010)) that have published their data in the previous years. (Bienvenu et al., 2002) analysed 301 different MECP2 alleles in a French population and found 69 different variations, which cause 64% of RTT. They identified NP_004983.1:p.R168*, R255*, R270*, T158M, and R306C (Table 5) as the most abundant variations and 59 variations were found in only one or two patients. In the list from the US national history study (819 participants (Percy et al., 2010)) the variations R106W, R133C, T158M, R168*, R255*, R270*, R294*, and R306C were responsible for more than 60% of RTT. The MECP2 variation content of RettBase was analyzed recently by (Krishnaraj et al., 2017) and the following eight hotspot variations are responsible for a total of 47% of RTT cases (of total number of MECP2 entries was at that time 4668, disease causing and benign): R106W, R133C, T158M, R168* , R255* , R270*, R294*, and R306C. (Percy et al., 2007) provides information about eleven more datasets from different countries.
Although our study resulted in a different ranking of the eight hotspots we could confirm these as the most abundant ones which occur in our dataset in 54.6% of all RTT causing database entries. All eight hotspot mutations are C>T transitions leading in seven of eight cases to a change from Arginine to a stop codon, Cysteine or Tryptophan which are changes with a high probability to change the 3D structure of the protein. The special vulnerability of certain Cytosine positions to errors in base excision repair was described before (Wang, Tang, Lai, & Zhang, 2014).