Introduction

Rett syndrome (RTT) is a rare neurological disorder first described in 1956 by Andreas Rett occurring predominantly in females (Rett, 1966). In most cases, the disorder is caused by a loss-of-function variation on the X-bound gene for MECP2 (methyl-CpG-binding protein 2) (Amir et al., 1999; Percy et al., 2007). Function affecting variations in several other genes can cause a RTT phenotype whereas several of these are involved directly in the up or downstream pathway of MECP2 or in the same biological processes (Ehrhart, Sangani, & Curfs, 2018; Lopes et al., 2016; Lucariello et al., 2016; Sajan et al., 2017; Vidal et al., 2017).
The disorder usually undergoes a development in four stages. In the first stage, it is typical that pre- and postnatal development are almost normal. However, in stage two, at the age of about 6 - 18 months, deceleration and stop of motoric and communication learning becomes apparent. In the third stage, patients are usually stable and typical phenotypes include moderate to severe intellectual disability, lack of motoric and (oral) communication skills, abnormal breathing patterns, sleep problems, stereotypic movements (hand wringing). Further, due to dystonia they often develop scoliosis (Neul et al., 2010). During development, the patients often appear to have autistic features due to the lack of communication skills. This is the reason why the disorder is often misclassified within the autism spectrum. In stage four, the motoric abilities continue on slowly decreasing while social and communication skills improve. The spectrum and development of RTT patients’ phenotype was investigated in several large natural history studies (Percy et al., 2010; Weaving et al., 2003). The phenotype severity is thought to vary generally due to X-inactivation, mosaicism, severity of the variation (loss of function vs. impaired function), genetic background ((Pizzo et al., 2018) and literature cited therein) and environmental factors.
On the molecular level, the MECP2 protein recognizes and binds to specific methylated and hydroxymethylated DNA regions, and attracts several other proteins to form a transcription repression block. This block makes the DNA sequence accessible for histone deacetylases, which increases the packing density of these regions, reducing their transcriptional activity (Nan et al., 1998). Thus, MECP2 represses transcription on the level of chromatin organization. MECP2 has several phosphorylation sites that when phosphorylated, e.g., after an incoming electric signal in a neuron, releases the DNA and allows gene transcription (Ebert et al., 2013; Tao et al., 2009). As MECP2 regulates the expression of many genes, the molecular downstream effects are very broad (Ehrhart et al., 2016; Liyanage & Rastegar, 2014). Several meta studies on omics data revealed that the influence of MECP2 affects dominantly dendritic connectivity, synapse function, glial cell differentiation, mitochondrial function, mRNA processing and translation, inflammation, and cytoskeleton (Bedogni et al., 2014; F Ehrhart et al., 2018; Shovlin & Tropea, 2018).
The MECP2 protein has five different domains: N-terminal domain (NTD), methyl-DNA binding domain (MDB), transcription repressor binding domain (TRD), intermediate domain between methyl-DNA binding and transcription repressor binding domain also called interdomain (ID), C-terminal domain (CTD) (Adams, McBryant, Wade, Woodcock, & Hansen, 2007). Ballestar and coworkers found that MECP2 variations that slightly decrease the specific recognition of the binding site on DNA are able to cause RTT (Ballestar et al., 2005). The majority of RTT causing missense variations are found in the methyl-DNA binding domain, but RTT causing variations have been found in all parts of the protein (Christodoulou, Grimm, Maher, & Bennetts, 2003). Some studies have found a distinctive correlation of phenotype severity and variation type (Neul et al., 2008), while others found a rather small or insignificant correlation (Amir et al., 2000; Auranen et al., 2001; Huppke, Laccone, Kramer, Engel, & Hanefeld, 2000; Nielsen et al., 2001).
Due to the rareness of RTT (prevalence about 1:10.000 (Laurvick et al., 2006)), it is important to share and communicate information about disease causing variations to increase the success of identifying genetic causes. In a previous study, we investigated the status of RTT genotype-phenotype databases and the methods that different resources use to share newly identified genetic variants on the example of RTT (Townend et al., 2018). Thirteen different genotype-phenotype databases were identified that are used to collect and share genetic variants annotated with observed or predicted effects. Our main conclusion was that databases store and provide information in very different ways, such that now it is technically infeasible to query multiple databases and combine the results in an efficient and automated way. In line with the IRDiRC aims for rare diseases (http://www.irdirc.org/about-us/vision-goals/), the bioinformatics infrastructure should contribute to store, curate and make data about known disease causing and benign variations available. Therefore, the interoperability of these databases needs to improve to be able to efficiently use their contents in combination.
In this study, we show how to integrate the available RTT genetic and phenotypic data across multiple databases and use the integrated data for further analysis about RTT, in order to investigate variant abundance and distribution and to test variant effect prediction algorithms. We followed the FAIRification workflow (Jacobsen et al., 2020) to make the data more findable, accessible, interoperable, and reusable for computer processing. In line with the FAIR data point specification, a combination of DCAT and Re3Data vocabularies were used to describe the data set [https://github.com/FAIRDataTeam/FAIRDataPoint-Spec/blob/v0.1.0/spec.md]. The resulting ‘FAIR data point’ refers to two distribution formats: one in RDF and one in CSV. RDF was used to create a self-describing, machine interpretable version of the data using existing global ontologies. The CSV distribution is also shared on Figshare (see DOI in results). To our knowledge, the combined data created and used in this study is the largest collection on disease causing and benign MECP2 variations available at this moment.