Authorea

Alejandra Gonzalez-Beltran edited sectionIntroduction_.tex almost 9 years ago

Commit id: bb53c38195d295e7e0141b87131e97e0f11a48d3

deletions | additions

\section{Introduction} With the advent of high throughput sequencing technologies, the cost of DNA sequencing has plummeted. Indeed the trend dwarfs even the long-standing and oft quoted "Moore’s Law" in its magnitude \cite{Hayden2014}. Increasingly, researchers are realising the benefits of data sharing, and most funding bodies and multiple journals require any and all data produced during research to be made publicly available at the time of publication. Apart from the obvious advantages that sharing data enables (enabling reproducible science, enhancing understanding of results, pooling of data for the sake of accuracy or comparative studies and the strengthening of collaborative ties, and so on), it also makes publication of spurious results (such as the infamous case of Diederik Stapel, thought to have fraudulently published over 30 journal articles \cite{Callaway2011}) less likely. This deluge of new data has necessitated the development of services such as the ENA (http://www.ebi.ac.uk/ena) European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena) and GenBank (http://www.ncbi.nlm.nih.gov/genbank/) for storing such enormous data sets. However, without careful labelling with appropriate metadata, data repositories harbour the risk of becoming simply silos of unstructured information. Furthermore, even when appropriate metadata is supplied at deposition, search systems within the repositories are often sub-optimal, making it hard for researchers to find datasets to be reused. In a study conducted in 2011 \cite{ScienceStaff2011} of approximately 1700 respondents, 20\% report that they regularly used data sets greater than 100 gigabytes, and 7\% used data sets greater than 1 terabyte. Half those polled didn’t use public repositories, choosing instead to store data privately in their organisation's infrastructure. Lack of common metadata and archives was cited as major issue and most of the respondents had no funding to support archiving. Submission formats to public repositories are heterogeneous, often requiring manual authoring of complex markup documents, taking researchers out of their fields of expertise. Finally, when considering analysis of public datasets, modern high throughput methods such as next generation sequencing are now producing more data than can be easily stored, let alone downloaded, making cloud based analysis software highly desirable. Therefore, in the first instance, COPO will attempt to solve these issues by providing responsive and intuitive interfaces for submitting research objects to public repositories, and getting these datasets into suitable analysis platforms. Using ISATab and ISATools technologies, these interfaces will be in the form of both web pages and APIs for programmatic interaction with the system. COPO will provide sophisticated ontology powered search capabilities making use of cutting edge technologies such as MongoDB (www.mongodb.org), JSON-LD (http://json-ld.org) and Elastic Search (www.elastic.co).