COPO - Bridging the Gap from Data to Publication in Plant Science

Abstract

We present Collaborative Open Plant Omics (COPO), a brokering service between plant scientists and public repositories which enables aggregation and publication of research outputs, as well as providing easy access to services comprising disparate sources of information via web interfaces and Application Programming Interfaces (APIs). Users will be able to deposit their data and view/aggregate open access data from a variety of sources, as well as seamlessly pulling these data into suitable analysis environments such as Galaxy (Goecks 2010) or iPlant (Goff 2011) and subsequently tracking the outputs and their metadata in COPO. COPO streamlines the process of data deposition to public repositories by hiding much of the complexity of metadata capture and data management from the end-user. The ISA infrastructure (Rocca-Serra 2010) is leveraged to provide the interoperability between metadata formats required for seamless deposition to public repositories and to facilitate links to data analysis platforms. Aggregated metadata are stored as Research Objects (Bechhofer 2010), with logical groupings of Research Objects relating to a body of work being represented in a common standard, and are publicly queryable. COPO therefore generates and facilitates access to a large network of ontologically related metadata, which will develop over time to allow for intelligent inference over open access Linked Data fragments, providing user-customised suggestions for future avenues of investigation and potential analyses.

Introduction

With the advent of high throughput sequencing technologies, the cost of DNA sequencing has plummeted. Indeed the trend dwarfs even the long-standing and oft quoted “Moore’s Law” in its magnitude (Hayden 2014). Increasingly, researchers are realising the benefits of data sharing, and most funding bodies and many journals require any and all data produced during research to be made publicly available at the time of publication. Apart from the obvious advantages that sharing data enables (enabling reproducible science, enhancing understanding of results, pooling of data for the sake of accuracy or comparative studies, the strengthening of collaborative ties, and so on), it also makes publication of spurious results (such as the infamous case of Diederik Stapel, thought to have fraudulently published over 30 journal articles (Callaway 2011)) less likely. This deluge of new data has necessitated the development of services such as the European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena) and GenBank (http://www.ncbi.nlm.nih.gov/genbank/) for storing such enormous data sets. However, without careful labelling with appropriate metadata, data repositories harbour the risk of becoming simply silos of unstructured information. Furthermore, even when appropriate metadata is supplied at deposition, search systems within the repositories are often sub-optimal, making it hard for researchers to find datasets to be reused. In a study conducted in 2011 (Science Staff 2011) of approximately 1700 respondents, around 20% report that they regularly used data sets greater than 100 gigabytes, and around 7% used data sets greater than 1 terabyte. Half those polled didn’t use public repositories, choosing instead to store data privately in their organisation’s infrastructure. Lack of common metadata and archives was cited as major issue and most of the respondents had no funding to support archiving. Submission formats to public repositories are heterogeneous, often requiring manual authoring of complex markup documents, taking researchers out of their fields of expertise. Finally, when considering analysis of public datasets, modern high throughput methods such as next generation sequencing are now producing more data than can be easily stored, let alone downloaded, making cloud based analysis software highly desirable. Therefore, in the first instance, COPO will attempt to solve these issues by providing responsive and intuitive interfaces for submitting research objects to public repositories, and getting these datasets into suitable analysis platforms. Using ISATab and ISATools technologies, these interfaces will be in the form of both web pages and APIs for programmatic interaction with the system. COPO will provide sophisticated ontology powered search capabilities making use of cutting edge technologies such as MongoDB (www.mongodb.org), JSON-LD (http://json-ld.org) and Elastic Search (www.elastic.co).

Background

The need for interoperable information systems in plant science

Scientists like effective services and hate context switching. Wastes time. Coupled with lots of research data. Lots of ways to describe that data, if it is described at all. Need consistency and intuitive approaches to show researchers benefits of structured data deposition.

Lots of data services to store, retrieve and analyse. Some work already undertaken to connect data repositories to analysis platforms, e.g. ENA to Galaxy/iPlant functionality. Cloud layers that offer Infastructure as a Service are too technical for typical plant science end users, and getting data into these platforms is unwieldy. Those that offer Platform- or Software-as-a-Service are those that we wish to target for integration into a greater whole.

Taken from grant - refactor.... “In discussion with plant researchers in academia and industry it is clear that there is a clear need for a UK-focused bioinformatics resource that draws on international expertise such as that of the iPlant Collaborative; helps to bring together species-based projects; reduces duplication of effort and prevents wasteful reinvention of tools and standards; exploits and builds existing and widely used resources and standards. Plant research faces bioinformatics challenges on several fronts. Only a few communities are sufficiently large, e.g. Arabidopsis and wheat, to have been funded at sufficient levels to produce data repositories and analysis tools. These are often bespoke for the community that they support and can be metadata-unaware, resulting in operational incompatibilities. Ploidy and cross-based genetic data are not common in the mammalian field and thus represent a unique integrative challenge for the plant sciences. Furthermore the lack, or absence, of relevant standards can hamper data sharing and reuse. Across the plant sciences data of different types are being generated, analysed and shared on a daily basis using a variety of tools, terminologies and formats. Without a common platform to encourage users to describe and tag their data in agreed manner and using the appropriate standards, it is difficult to uncover and reuse sequence data and, in the case of proteomics and metabolomics, it can often make comparison between datasets impossible. These and other barriers relating to data sharing and re-use have recently been highlighted in a paper by Leonelli et al, which arose from discussions at GARNet/EGENIS community workshop that brought together plant researchers working in a range of omics areas with publishers and funders to discuss current barriers and bottlenecks to data access, curation storage and discoverability. COPO will provide a single use friendly platform to allow user to store, annotate, analyse and publish data and will therefore help to overcome some of the current issues and concerns raised by the plant science community.”

Metadata interoperability

What does ISA provide that makes the project feasible in terms of metadata interoperability? Data integration is hard because things aren’t described well enough to link them up. ISA makes applying descriptions easy and standardised.

Brokering as efficient data management

Need for effectively managed data flow with metadata tracking at all stages to lay foundations for subsequent reproducibility/recomputability.

The brief “how bit” before the main Building... section below

  1. COPO user interfaces

    • New user interfaces are required to facilitate interaction between the user and the ISA software suite for metadata annotation and raw data preparation. These are taking the form of web-based tools that are enabling consolidated access to a range of metadata repositories that are either already supported in the ISA framework, or will be included within this project

  2. COPO data / metadata deposition APIs and services

    • APIs are currently being developed to facilitate deposition of data and metadata prepared in 1(a)

    • APIs and services will be developed to allow programmatic access to deposited data and metadata, akin to Ensembl APIs and other resources APIs

  3. COPO data publication APIs and services

    • APIs will be developed to facilitate submission of data and metadata to publication platforms such as Scientific Data, GigaScience and F1000 at the project outset

    • APIs and services will be developed to allow programmatic access to published knowledge via accessions and search terms

  4. COPO developer / bioinformatician APIs and services

    • Low level APIs and services will be developed across the COPO framework to allow bioinformaticians to leverage the whole technology platform from their own programs

    • This will improve uptake of the framework, and the consolidated access to the realm of data provided by the COPO interoperability technologies (the APIs and services noted above) will be a significantly beneficial component to the plant bioinformatics community

\label{fig:COPO_PROPOSED} The proposed COPO architecture, with the 4 major integrative layers that are to be developed numbered.

Building the framework

Instead of reinventing a number of wheels, COPO builds on existing services and APIs to allow the aggregation of research objects into a broader body of work (see Figure \ref{fig:COPO_PROPOSED}). These research objects may include source code, sequence data, PDF documents, images, movie files, presentations, and so on. Collections of research objects are bundled together into a designated persistent container, providing permanent links between related but heterogeneous sources of information which otherwise would be disconnected once these objects leave the researcher’s computer.

Figure \ref{fig:COPO_ARCHITECTURE} shows a high level schematic view of COPO. Two pathways exist through the system. The first, for anonymous users, allows for querying only. Users can supply keywords to be searched in COPO’s metadata catalogue. The supplied search terms can be augmented with semantically derived results. For instance, utilising the Plant Ontology [REF] and Crop Ontology [REF] would allow COPO to understand what a gene is and its contextual place in terms of related genes, species in which the gene is observed, observed mutations, and so on. These connections and inferences allow for a much richer set of results to be returned to the user than a simple text search would provide. When the user clicks on one of these results, they are taken to the COPO Profile in which the relevant research objects reside. Here they would see, for example, published manuscripts, original and processed sequence data, links to source code repositories, and analysis platform workflows. Therefore all of the information and material would be available to the user and the experiments described should become far more reproducible.

The second pathway requires the user to login. By using technologies such as ORCiD http://orcid.org/, Twitter https://twitter.com/ and OAUTH http://oauth.net/, a user will be able to login to COPO using existing credentials thus negating the need to create a new account and offloading some of the burden of security onto existing and trusted industry standard methods. Additionally, services like ORCiD will enable COPO to federate a researcher’s existing information into their COPO Profile, such as professional contact details, previous publications and collaborators. Once logged in, all the query capability described above will be available, as well as the ability to create COPO Profiles. A Profile can be thought of as the digital location of a complete body of research, fully attributed to one or more researchers. Responsive web forms allow the Profile to be properly labelled with metadata relating to creators, contributors, institutions, subject, sample, methodology, etc. To a Profile can be added Collections of objects. Such Collections are delimited by file type or function. For instance, one collection may contain all the sequence data associated with a study. Another collection may contain a number of published manuscripts and yet another might contain references to a number of source code repositories or analysis workflows. When creating these Collections, essential metadata can be attributed which properly describe the objects therein. By taking this metadata and integrating it with existing ontologies, COPO not only indexes the research objects passing through, but semantically enriches these objects. They are no longer simply collections of unstructured unrelated data, but entities described in terms of their similarity to other existing objects. It then becomes possible to make inferences about the kinds of things a researcher might be interested in, based on the samples, studies, manuscripts, source code, file types, abstracts, methodologies, references, and institutions, which reside within researchers’ research objects.

Since COPO is a brokering service, the raw data within a Profile is not physically stored on its servers for extended periods, as would be available in an archival service. Rather, once a collection of research objects has been uploaded to and labelled within a COPO Profile, they are seamlessly deposited into the relevant public repositories. Such repositories return unique accessions which are then used to identify the deposited data files. These accessions are stored within the COPO Profile alongside the user-supplied metadata. If any of the data files are needed as part of a user query, they can be easily and quickly downloaded again from the repository to COPO’s infrastructure, ready to be used in any subsequent analysis.

So, COPO’s efficient and intuitive web interfaces allow for the input of important metadata, and by minting DOIs (Digital Object Identifiers) that identify COPO Profiles, these metadata can be published as persistent first class entities on the Internet. A resolution service, such as dx.doi.org, will direct users back to the COPO Profile identified by the DOI. Therefore, whether the DOI appears alongside a dataset, a paper or a code repository, the user can see all related research objects in a single view without having to search repositories and websites individually. Since DOIs guarantee data permanence, the system will alleviate the problem of “link decay” which occurs when resources referenced by a URL become permanently unavailable. The mapping of DOIs to research objects additionally allows for usage of these objects to be directly tracked and referenced enabling researchers to be properly recognised and credited for all the outputs they produce, not just by the typical route of publishing a paper which is not a comprehensive, truly digitally accessible, representation of a whole body of work. Since depositing objects through COPO naturally builds a large interconnected graph of metadata, ontological inferences can be made and used to suggest for example, further literature searches, experimental procedures or comparable datasets.

This seamless interaction of deposition, metadata labelling, semantically-enabled searching and data attribution represents a novel way of gluing together existing services to greatly enhance what is currently available to plant scientists, thereby enabling them to find more relevant information more quickly. Most importantly, it enables them to deposit their research outputs into the public domain with little effort and get credited for doing so.

COPO Submission APIs

COPO builds on a rich set of APIs to facilitate the submission of research objects (e.g. raw sequencing data) to disparate data stores and repositories. These APIs are managed transparently within COPO to lift the burden of data deposition or transfer away from the user of the system. Some of the issues COPO’s submission APIs attempt to address include, but are not limited to, conversions between different formats (e.g. tab-delimited formats and XML, (Rocca-Serra 2010)), “big data” transfer issues (e.g. delay overheads, data integrity and privacy), reproducing a piece of research or performing analysis on deposited data, etc. In Figure \ref{figure1}, we highlighted on the interaction of COPO with existing data infrastructure, the interoperability of which is made possible through the use of APIs. In what follows, we provide more specific discussions on the different APIs enabled by COPO for submissions to the different repositories and data stores captured in \ref{figure1}.

ENA Submission APIs

The European Nucleotide Archive (ENA) is an established repository for storing nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation (http://www.ebi.ac.uk/ena). As a matter of fact, the provision of nucleotide sequence data to ENA is currently regarded as a mandatory step in the dissemination of research outputs to the scientific community . The data submission workflow of ENA entails that data files must be uploaded before they can be submitted. To achieve this, ENA provides support for several software components to assist users in submitting data to the repository. Among them is fasp transfer technology, which is especially recommended by ENA for long distance file transfers. The fasp transfer technology, developed by Aspera (http://asperasoft.com), eliminates the fundamental shortcomings of conventional TCP-based file transfer technologies such as FTP and HTTP. Data transfer with fasp is recorded as achieving speeds that are hundreds of times faster than FTP/HTTP (Marx 2013). COPO provides an API (with a web-based interface), which builds on the Aspera fasp transfer technology, for uploading files to users’ dropbox in ENA. Using this functionality (activated with just a single click), the user can monitor the progress of the uploaded data. Also, metadata about the upload process (e.g. time of upload or process initiator) can be recorded and made available to other components of the system. The ISA API is also used within this context to enable conversions to data formats (e.g. XML) supported by ENA before a submission can be made. Once a submission is completed, as enabled by this array of APIs, an accession is obtained and maintained within COPO.

Figshare Submission APIs

Apart from the manuscripts produced in a body of research, other important publishable outputs are things like posters, images, presentations, movie files and figures. Whilst often overlooked, these assets can often prove crucial to proper understanding and dissemination of research. COPO has support for Figshare (www.figshare.com), an online repository specifically for storing and sharing such resources. A Collection of these secondary research objects can be created and added to a Profile along with descriptive metadata tags. Upon submission, COPO uploads these files to Figshare via the Figshare API, using OAUTH technology to facilitate proper authentication and authorisation. The returned Figshare accession(s) are then stored in the Profile for later retrieval. Therefore, these resources can be accessed easily within the Profile along with the other more conventional outputs produced during a project.

iRODS Submission APIs

As (Marx 2013) suggests, there is no reason to move data outside a remote infrastructure (e.g. the cloud); analysis can be done right there. COPO enables such a continuum by providing a seamless integration with heterogeneous storage assets that are abstracted away from individual user physical storage. In particular, COPO integrates with iRODS - Integrated Rule-Oriented Data System (https://irods.org/), an open source data management package, to offload the burden of “big data” management from the system thus, enhancing its service-brokering objectives. iRODS enables data virtualisation and allows access to distributed storage assets under a unifying namespace. iRODS enables data discovery using a metadata catalog that describes files, directories, and storage resources in the data grid (Rajasekar 2010). In the context of COPO, iRODS will give plant scientists the ability to create virtual data archives, where their data are seamlessly preserved and curated with policy-based rules.

Data files uploaded by an end-user to COPO are automatically routed to a connected iRODS instance, thus removing the need to physically store those objects in COPO. Also, data downloaded from remote repositories (e.g. ENA), using COPO interfaces, can be held in iRODS. This data workflow may subsequently be exploited, for instance, to perform analysis on the target data on platforms such as Galaxy or iPlant. Importantly, this can be achieved without necessarily involving the end-user’s computing or storage resources or requiring the moving around of large data between multiple data analysis environments. A key part of this integration (with iRODS) is achieved through PyRods (http://code.google.com/p/irodspython), an open source Python client API for accessing an iRODS server. The PyRods “microservice” enables the management of data objects in iRODS including registering, metadata attribution, and retrieval of data objects.

\label{fig:COPO_ARCHITECTURE} Current COPO Architecture, showing the interaction between the user and various software components

Conclusions and Future Work

  • Concentrated on prototype APIs for elements 1 and 2 of the COPO overview.

  • Addressing 3 and 4 will be later.

  • Shown that COPO can make the process of depositing multi-faceted research data easier, i.e. ENA, Figshare.

  • Working on including more omics data types in the coming months, i.e. metabolomics.

  • Using semantic linked metadata from the outset, rather than an afterthought, increasing potential power of the framework throughout its development.

Acknowledgments

COPO is funded by a UK Biotechnology and Biological Sciences Research Council (BBSRC) Bioinformatics and Biological Resources Fund (BBR) grant: BB/L021390/1 [BB/L024055/1, BB/L024071/1, BB/L024101/1].

References

  1. Sean Bechhofer, David De Roure, Matthew Gamble, Carole Goble, Iain Buchan. Research Objects: Towards exchange and reuse of digital knowledge. The Future of the Web for Collaborative Science (2010).

  2. Ewen Callaway. Report finds massive fraud at Dutch universities.. Nature 479, 15 Nature Publishing Group, 2011. Link

  3. Jeremy Goecks, Anton Nekrutenko, James Taylor, others. Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11, R86 (2010).

  4. Stephen A. Goff, Matthew Vaughn, Sheldon McKay, Eric Lyons, Ann E. Stapleton, Damian Gessler, Naim Matasci, Liya Wang, Matthew Hanlon, Andrew Lenards, Andy Muir, Nirav Merchant, Sonya Lowry, Stephen Mock, Matthew Helmke, Adam Kubach, Martha Narro, Nicole Hopkins, David Micklos, Uwe Hilgert, Michael Gonzales, Chris Jordan, Edwin Skidmore, Rion Dooley, John Cazes, Robert McLay, Zhenyuan Lu, Shiran Pasternak, Lars Koesterke, William H. Piel, Ruth Grene, Christos Noutsos, Karla Gendler, Xin Feng, Chunlao Tang, Monica Lent, Seung-Jin Kim, Kristian Kvilekval, B. S. Manjunath, Val Tannen, Alexandros Stamatakis, Michael Sanderson, Stephen M. Welch, Karen A. Cranston, Pamela Soltis, Doug Soltis, Brian O’Meara, Cecile Ane, Tom Brutnell, Daniel J. Kleibenstein, Jeffery W. White, James Leebens-Mack, Michael J. Donoghue, Edgar P. Spalding, Todd J. Vision, Christopher R. Myers, David Lowenthal, Brian J. Enquist, Brad Boyle, Ali Akoglu, Greg Andrews, Sudha Ram, Doreen Ware, Lincoln Stein, Dan Stanzione. The iPlant Collaborative: Cyberinfrastructure for Plant Biology.. Frontiers in plant science 2, 34 (2011). Link

  5. Erika Check Hayden. Technology: The $1,000 genome.. Nature 507, 294–5 (2014). Link

  6. Vivien Marx. Biology: The big challenges of big data. Nature 498, 255–260 Nature Publishing Group, 2013. Link

  7. Arcot Rajasekar, Reagan Moore, Chien-yi Hou, Christopher A Lee, Richard Marciano, Antoine de Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas Gilbert, others. iRODS Primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services 2, 1–143 Morgan & Claypool Publishers, 2010.

  8. Philippe Rocca-Serra, Marco Brandizi, Eamonn Maguire, Nataliya Sklyar, Chris Taylor, Kimberly Begley, Dawn Field, Stephen Harris, Winston Hide, Oliver Hofmann, Steffen Neumann, Peter Sterk, Weida Tong, Susanna Assunta Sansone. ISA software suite: Supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics 26, 2354–2356 (2010). Link

  9. Science Staff. Dealing with data. Challenges and opportunities. Introduction.. Science (New York, N.Y.) 331, 692–3 (2011). Link

[Someone else is editing this]

You are editing this file