COPO - Bridging the Gap from Data to Publication in Plant Science
We present Collaborative Open Plant Omics (COPO), a brokering service between plant scientists and public repositories which enables aggregation and publication of research outputs, as well as providing easy access to services comprising disparate sources of information via web interfaces and Application Programming Interfaces (APIs). Users will be able to deposit their data and view/aggregate open access data from a variety of sources, as well as seamlessly pulling these data into suitable analysis environments such as Galaxy (Goecks 2010) or iPlant (Goff 2011) and subsequently tracking the outputs and their metadata in COPO. COPO streamlines the process of data deposition to public repositories by hiding much of the complexity of metadata capture and data management from the end-user. The ISA infrastructure (Rocca-Serra 2010) is leveraged to provide the interoperability between metadata formats required for seamless deposition to public repositories and to facilitate links to data analysis platforms. Aggregated metadata are stored as Research Objects (Bechhofer 2010), with logical groupings of Research Objects relating to a body of work being represented in a common standard, and are publicly queryable. COPO therefore generates and facilitates access to a large network of ontologically related metadata, which will develop over time to allow for intelligent inference over open access Linked Data fragments, providing user-customised suggestions for future avenues of investigation and potential analyses.
With the advent of high throughput sequencing technologies, the cost of DNA sequencing has plummeted. Indeed the trend dwarfs even the long-standing and oft quoted “Moore’s Law” in its magnitude (Hayden 2014). Increasingly, researchers are realising the benefits of data sharing, and most funding bodies and many journals require any and all data produced during research to be made publicly available at the time of publication. Apart from the obvious advantages that sharing data enables (enabling reproducible science, enhancing understanding of results, pooling of data for the sake of accuracy or comparative studies, the strengthening of collaborative ties, and so on), it also makes publication of spurious results (such as the infamous case of Diederik Stapel, thought to have fraudulently published over 30 journal articles (Callaway 2011)) less likely. This deluge of new data has necessitated the development of services such as the European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena) and GenBank (http://www.ncbi.nlm.nih.gov/genbank/) for storing such enormous data sets. However, without careful labelling with appropriate metadata, data repositories harbour the risk of becoming simply silos of unstructured information. Furthermore, even when appropriate metadata is supplied at deposition, search systems within the repositories are often sub-optimal, making it hard for researchers to find datasets to be reused. In a study conducted in 2011 (Science Staff 2011) of approximately 1700 respondents, around 20% report that they regularly used data sets greater than 100 gigabytes, and around 7% used data sets greater than 1 terabyte. Half those polled didn’t use public repositories, choosing instead to store data privately in their organisation’s infrastructure. Lack of common metadata and archives was cited as major issue and most of the respondents had no funding to support archiving. Submission formats to public repositories are heterogeneous, often requiring manual authoring of complex markup documents, taking researchers out of their fields of expertise. Finally, when considering analysis of public datasets, modern high throughput methods such as next generation sequencing are now producing more data than can be easily stored, let alone downloaded, making cloud based analysis software highly desirable. Therefore, in the first instance, COPO will attempt to solve these issues by providing responsive and intuitive interfaces for submitting research objects to public repositories, and getting these datasets into suitable analysis platforms. Using ISATab and ISATools technologies, these interfaces will be in the form of both web pages and APIs for programmatic interaction with the system. COPO will provide sophisticated ontology powered search capabilities making use of cutting edge technologies such as MongoDB (www.mongodb.org), JSON-LD (http://json-ld.org) and Elastic Search (www.elastic.co).
Scientists like effective services and hate context switching. Wastes time. Coupled with lots of research data. Lots of ways to describe that data, if it is described at all. Need consistency and intuitive approaches to show researchers benefits of structured data deposition.
Lots of data services to store, retrieve and analyse. Some work already undertaken to connect data repositories to analysis platforms, e.g. ENA to Galaxy/iPlant functionality. Cloud layers that offer Infastructure as a Service are too technical for typical plant science end users, and getting data into these platforms is unwieldy. Those that offer Platform- or Software-as-a-Service are those that we wish to target for integration into a greater whole.
Taken from grant - refactor.... “In discussion with plant researchers in academia and industry it is clear that there is a clear need for a UK-focused bioinformatics resource that draws on international expertise such as that of the iPlant Collaborative; helps to bring together species-based projects; reduces duplication of effort and prevents wasteful reinvention of tools and standards; exploits and builds existing and widely used resources and standards. Plant research faces bioinformatics challenges on several fronts. Only a few communities are sufficiently large, e.g. Arabidopsis and wheat, to have been funded at sufficient levels to produce data repositories and analysis tools. These are often bespoke for the community that they support and can be metadata-unaware, resulting in operational incompatibilities. Ploidy and cross-based genetic data are not common in the mammalian field and thus represent a unique integrative challenge for the plant sciences. Furthermore the lack, or absence, of relevant standards can hamper data sharing and reuse. Across the plant sciences data of different types are being generated, analysed and shared on a daily basis using a variety of tools, terminologies and formats. Without a common platform to encourage users to describe and tag their data in agreed manner and using the appropriate standards, it is difficult to uncover and reuse sequence data and, in the case of proteomics and metabolomics, it can often make comparison between datasets impossible. These and other barriers relating to data sharing and re-use have recently been highlighted in a paper by Leonelli et al, which arose from discussions at GARNet/EGENIS community workshop that brought together plant researchers working in a range of omics areas with publishers and funders to discuss current barriers and bottlenecks to data access, curation storage and discoverability. COPO will provide a single use friendly platform to allow user to store, annotate, analyse and publish data and will therefore help to overcome some of the current issues and concerns raised by the plant science community.”
What does ISA provide that makes the project feasible in terms of metadata interoperability? Data integration is hard because things aren’t described well enough to link them up. ISA makes applying descriptions easy and standardised.
Need for effectively managed data flow with metadata tracking at all stages to lay foundations for subsequent reproducibility/recomputability.
COPO user interfaces
New user interfaces are required to facilitate interaction between the user and the ISA software suite for metadata annot