As appearing on eLife Labs
Authors: Lilly Winfree, Julie McMurry, Melissa Haendel
The Monarch Initiative is an international consortium, comprised of an
eclectic mix of biologists, ontologists, programmers, geneticists, and
clinicians; we take disparate genetic and phenotypic data and integrate
it for disease discovery and diagnosis. Towards this end, we have
developed a web portal
(
https://monarchinitiative.org)
and numerous tools and new APIs that we want to share with scientists
across disciplines.
At Monarch, the data we are unifying is diverse and describes
fundamentally different kinds of observations. Our integrated data
corpus currently contains data from 35+ sources and 100+ genuses, and is
composed of different data types such as genetic expression and
variation, disease associations, different kinds of phenotypes, and is
represented together with the type of evidence that supports it (e.g. a
PMID, a traceable author statement, or a genetic sequence similarity
score). But we don’t only integrate the data - we also provide more
complex algorithmic associations, such as phenotypic enrichment or
statistical modeling, all with a goal of illuminating new connections
between this data.
Why are we concerned with programmatically and computationally
integrating this data? As an example use case, take a physician with a
patient diagnosed with Fanconi Anemia (FA). The physician is interested
in a new potential treatment, and is searching the internet for
literature related to FA using the patient’s symptoms of “skeletal
anomalies of the hips, spine” as keywords. However, this same feature
of FA is described as “kinked tail” in mouse models of FA, resulting
in reduced literature search results for the physician, and therefore
less knowledge transfer. In our ideal world, the physician would search
over our integrated data on our
WebPortal and find relevant data that would include information from
humans and other models, and that integrated data could better inform
the physician.
However, integrating this data is easier said than done. Each piece of
data can be thought of as a puzzle piece; it contains various colors and
contours, but it is difficult to see the big picture without assembling
all the puzzle pieces. The purpose of data integration is insight, not
raw connections. When integrating data, we first aggregate all the data,
but this alone is insufficient to solve the puzzle. To fully integrate
this data, we cannot simply dump all the puzzle pieces together; we must
assemble them in the correct order. However, this assembly can be very
difficult without the right model.