ROUGH DRAFT authorea.com/27296
Main Data History
Export
Show Index Toggle 5 comments
  •  Quick Edit
  • COPO - Bridging the Gap from Data to Publication in Plant Science

    Abstract

    We present Collaborative Open Plant Omics (COPO), a brokering service between plant scientists and public repositories which enables aggregation and publication of research outputs, as well as providing easy access to services comprising disparate sources of information via web interfaces and Application Programming Interfaces (APIs). Users will be able to deposit their data and view/aggregate open access data from a variety of sources, as well as seamlessly pulling these data into suitable analysis environments such as Galaxy (Goecks 2010) or iPlant (Goff 2011) and subsequently tracking the outputs and their metadata in COPO. COPO streamlines the process of data deposition to public repositories by hiding much of the complexity of metadata capture and data management from the end-user. The ISA infrastructure (Rocca-Serra 2010) is leveraged to provide the interoperability between metadata formats required for seamless deposition to public repositories and to facilitate links to data analysis platforms. Aggregated metadata are stored as Research Objects (Bechhofer 2010), with logical groupings of Research Objects relating to a body of work being represented in a common standard, and are publicly queryable. COPO therefore generates and facilitates access to a large network of ontologically related metadata, which will develop over time to allow for intelligent inference over open access Linked Data fragments, providing user-customised suggestions for future avenues of investigation and potential analyses.

    Introduction

    With the advent of high throughput sequencing technologies, the cost of DNA sequencing has plummeted. Indeed the trend dwarfs even the long-standing and oft quoted “Moore’s Law” in its magnitude (Hayden 2014). Increasingly, researchers are realising the benefits of data sharing, and most funding bodies and many journals require any and all data produced during research to be made publicly available at the time of publication. Apart from the obvious advantages that sharing data enables (enabling reproducible science, enhancing understanding of results, pooling of data for the sake of accuracy or comparative studies, the strengthening of collaborative ties, and so on), it also makes publication of spurious results (such as the infamous case of Diederik Stapel, thought to have fraudulently published over 30 journal articles (Callaway 2011)) less likely. This deluge of new data has necessitated the development of services such as the European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena) and GenBank (http://www.ncbi.nlm.nih.gov/genbank/) for storing such enormous data sets. However, without careful labelling with appropriate metadata, data repositories harbour the risk of becoming simply silos of unstructured information. Furthermore, even when appropriate metadata is supplied at deposition, search systems within the repositories are often sub-optimal, making it hard for researchers to find datasets to be reused. In a study conducted in 2011 (Science Staff 2011) of approximately 1700 respondents, around 20% report that they regularly used data sets greater than 100 gigabytes, and around 7% used data sets greater than 1 terabyte. Half those polled didn’t use public repositories, choosing instead to store data privately in their organisation’s infrastructure. Lack of common metadata and archives was cited as major issue and most of the respondents had no funding to support archiving. Submission formats to public repositories are heterogeneous, often requiring manual authoring of complex markup documents, taking researchers out of their fields of expertise. Finally, when considering analysis of public datasets, modern high throughput methods such as next generation sequencing are now producing more data than can be easily stored, let alone downloaded, making cloud based analysis software highly desirable. Therefore, in the first instance, COPO will attempt to solve these issues by providing responsive and intuitive interfaces for submitting research objects to public repositories, and getting these datasets into suitable analysis platforms. Using ISATab and ISATools technologies, these interfaces will be in the form of both web pages and APIs for programmatic interaction with the system. COPO will provide sophisticated ontology powered search capabilities making use of cutting edge technologies such as MongoDB (www.mongodb.org), JSON-LD (http://json-ld.org) and Elastic Search (www.elastic.co).