As appearing on eLife LabsAuthors: Lilly Winfree, Julie McMurry, Melissa HaendelThe Monarch Initiative is an international consortium, comprised of an eclectic mix of biologists, ontologists, programmers, geneticists, and clinicians; we take disparate genetic and phenotypic data and integrate it for disease discovery and diagnosis. Towards this end, we have developed a web portal (https://monarchinitiative.org) and numerous tools and new APIs that we want to share with scientists across disciplines.At Monarch, the data we are unifying is diverse and describes fundamentally different kinds of observations. Our integrated data corpus currently contains data from 35+ sources and 100+ genuses, and is composed of different data types such as genetic expression and variation, disease associations, different kinds of phenotypes, and is represented together with the type of evidence that supports it (e.g. a PMID, a traceable author statement, or a genetic sequence similarity score). But we don’t only integrate the data - we also provide more complex algorithmic associations, such as phenotypic enrichment or statistical modeling, all with a goal of illuminating new connections between this data.Why are we concerned with programmatically and computationally integrating this data? As an example use case, take a physician with a patient diagnosed with Fanconi Anemia (FA). The physician is interested in a new potential treatment, and is searching the internet for literature related to FA using the patient’s symptoms of “skeletal anomalies of the hips, spine” as keywords. However, this same feature of FA is described as “kinked tail” in mouse models of FA, resulting in reduced literature search results for the physician, and therefore less knowledge transfer. In our ideal world, the physician would search over our integrated data on our WebPortal and find relevant data that would include information from humans and other models, and that integrated data could better inform the physician.However, integrating this data is easier said than done. Each piece of data can be thought of as a puzzle piece; it contains various colors and contours, but it is difficult to see the big picture without assembling all the puzzle pieces. The purpose of data integration is insight, not raw connections. When integrating data, we first aggregate all the data, but this alone is insufficient to solve the puzzle. To fully integrate this data, we cannot simply dump all the puzzle pieces together; we must assemble them in the correct order. However, this assembly can be very difficult without the right model.
ABSTRACT A central tenet in support of research reproducibility is the ability to uniquely identify research resources, i.e., reagents, tools, and materials that are used to perform experiments. However, current reporting practices for research resources are insufficient to identify the exact resources that are reported or answer basic questions such as “How did other studies use resource X?”. To address this issue, the Resource Identification Initiative was launched as a pilot project to improve the reporting standards for research resources in the methods sections of papers and thereby improve identifiability and reproducibility. The pilot engaged over 25 biomedical journal editors from most major publishers, as well as scientists and funding officials. Authors were asked to include Research Resource Identifiers (RRIDs) in their manuscripts prior to publication for three resource types: antibodies, model organisms, and tools (i.e. software and databases). RRIDs are assigned by an authoritative database, for example a model organism database, for each type of resource. To make it easier for authors to obtain RRIDs, resources were aggregated from the appropriate databases and their RRIDs made available in a central web portal (scicrunch.org/resources). RRIDs meet three key criteria: they are machine readable, free to generate and access, and are consistent across publishers and journals. The pilot was launched in February of 2014 and over 300 papers have appeared that report RRIDs. The number of journals participating has expanded from the original 25 to more than 40. Here, we present an overview of the pilot project and its outcomes to date. We show that authors are able to identify resources and are supportive of the goals of the project. Identifiability of the resources post-pilot showed a dramatic improvement for all three resource types, suggesting that the project has had a significant impact on reproducibility relating to research resources.