Introduction
Rett syndrome (RTT) is a rare neurological disorder first described in
1956 by Andreas Rett occurring predominantly in females (Rett, 1966). In
most cases, the disorder is caused by a loss-of-function variation on
the X-bound gene for MECP2 (methyl-CpG-binding protein 2) (Amir et al.,
1999; Percy et al., 2007). Function affecting variations in several
other genes can cause a RTT phenotype whereas several of these are
involved directly in the up or downstream pathway of MECP2 or in the
same biological processes (Ehrhart, Sangani, & Curfs, 2018; Lopes et
al., 2016; Lucariello et al., 2016; Sajan et al., 2017; Vidal et al.,
2017).
The disorder usually undergoes a development in four stages. In the
first stage, it is typical that pre- and postnatal development are
almost normal. However, in stage two, at the age of about 6 - 18 months,
deceleration and stop of motoric and communication learning becomes
apparent. In the third stage, patients are usually stable and typical
phenotypes include moderate to severe intellectual disability, lack of
motoric and (oral) communication skills, abnormal breathing patterns,
sleep problems, stereotypic movements (hand wringing). Further, due to
dystonia they often develop scoliosis (Neul et al., 2010). During
development, the patients often appear to have autistic features due to
the lack of communication skills. This is the reason why the disorder is
often misclassified within the autism spectrum. In stage four, the
motoric abilities continue on slowly decreasing while social and
communication skills improve. The spectrum and development of RTT
patients’ phenotype was investigated in several large natural history
studies (Percy et al., 2010; Weaving et al., 2003). The phenotype
severity is thought to vary generally due to X-inactivation, mosaicism,
severity of the variation (loss of function vs. impaired function),
genetic background ((Pizzo et al., 2018) and literature cited therein)
and environmental factors.
On the molecular level, the MECP2 protein recognizes and binds to
specific methylated and hydroxymethylated DNA regions, and attracts
several other proteins to form a transcription repression block. This
block makes the DNA sequence accessible for histone deacetylases, which
increases the packing density of these regions, reducing their
transcriptional activity (Nan et al., 1998). Thus, MECP2 represses
transcription on the level of chromatin organization. MECP2 has several
phosphorylation sites that when phosphorylated, e.g., after an incoming
electric signal in a neuron, releases the DNA and allows gene
transcription (Ebert et al., 2013; Tao et al., 2009). As MECP2 regulates
the expression of many genes, the molecular downstream effects are very
broad (Ehrhart et al., 2016; Liyanage & Rastegar, 2014). Several meta
studies on omics data revealed that the influence of MECP2 affects
dominantly dendritic connectivity, synapse function, glial cell
differentiation, mitochondrial function, mRNA processing and
translation, inflammation, and cytoskeleton (Bedogni et al., 2014; F
Ehrhart et al., 2018; Shovlin & Tropea, 2018).
The MECP2 protein has five different domains: N-terminal domain (NTD),
methyl-DNA binding domain (MDB), transcription repressor binding domain
(TRD), intermediate domain between methyl-DNA binding and transcription
repressor binding domain also called interdomain (ID), C-terminal domain
(CTD) (Adams, McBryant, Wade, Woodcock, & Hansen, 2007). Ballestar and
coworkers found that MECP2 variations that slightly decrease the
specific recognition of the binding site on DNA are able to cause RTT
(Ballestar et al., 2005). The majority of RTT causing missense
variations are found in the methyl-DNA binding domain, but RTT causing
variations have been found in all parts of the protein (Christodoulou,
Grimm, Maher, & Bennetts, 2003). Some studies have found a distinctive
correlation of phenotype severity and variation type (Neul et al.,
2008), while others found a rather small or insignificant correlation
(Amir et al., 2000; Auranen et al., 2001; Huppke, Laccone, Kramer,
Engel, & Hanefeld, 2000; Nielsen et al., 2001).
Due to the rareness of RTT (prevalence about 1:10.000 (Laurvick et al.,
2006)), it is important to share and communicate information about
disease causing variations to increase the success of identifying
genetic causes. In a previous study, we investigated the status of RTT
genotype-phenotype databases and the methods that different resources
use to share newly identified genetic variants on the example of RTT
(Townend et al., 2018). Thirteen different genotype-phenotype databases
were identified that are used to collect and share genetic variants
annotated with observed or predicted effects. Our main conclusion was
that databases store and provide information in very different ways,
such that now it is technically infeasible to query multiple databases
and combine the results in an efficient and automated way. In line with
the IRDiRC aims for rare diseases
(http://www.irdirc.org/about-us/vision-goals/),
the bioinformatics infrastructure should contribute to store, curate and
make data about known disease causing and benign variations available.
Therefore, the interoperability of these databases needs to improve to
be able to efficiently use their contents in combination.
In this study, we show how to integrate the available RTT genetic and
phenotypic data across multiple databases and use the integrated data
for further analysis about RTT, in order to investigate variant
abundance and distribution and to test variant effect prediction
algorithms. We followed the FAIRification workflow (Jacobsen et al.,
2020) to make the data more findable, accessible, interoperable, and
reusable for computer processing. In line with the FAIR data point
specification, a combination of DCAT and Re3Data vocabularies were used
to describe the data set
[https://github.com/FAIRDataTeam/FAIRDataPoint-Spec/blob/v0.1.0/spec.md].
The resulting ‘FAIR data point’ refers to two distribution formats: one
in RDF and one in CSV. RDF was used to create a self-describing, machine
interpretable version of the data using existing global ontologies. The
CSV distribution is also shared on Figshare (see DOI in results). To our
knowledge, the combined data created and used in this study is the
largest collection on disease causing and benign MECP2 variations
available at this moment.