Authorea

Robert Davey edited sectionProgress_Coll.tex almost 9 years ago

Commit id: d1fd8964fa30b9b723c8faa1af168af006d9c984

deletions | additions

\section{Building the framework} Instead of reinventing a number of wheels, COPO builds on existing services and APIs to allow the aggregation of research objects into a broader body of work. These research objects may include source code, sequence data, PDF documents, images, movie files, presentations, and so on. Collections of research objects are bundled together into a COPO Profile, providing permanent links between related but heterogeneous sources of information which otherwise would be disconnected once these objects leave the researcher’s computer.Efficient and attractive web interfaces allow for the input of important metadata and minted DOIs (Digital Object Identifiers) identify COPO profiles and can be published as persistent first class entities on the Internet. A resolution service, such as dx.doi.org, will direct users back to the COPO Profile identified by the DOI. Therefore whether the DOI appears alongside a dataset, a paper or a code repository, the user can see all related research objects in one place. Since DOIs guarantee data permanence, the system will alleviate the problem of "link decay" which occurs when resources referenced by a URL become permanently unavailable. The mapping of DOIs to research objects additionally allows for usage of these objects to be directly tracked and referenced enabling researchers to be properly recognised and credited for all the outputs they produce, not just by the typical route of publishing a paper which is not a comprehensive, truly digitally accessible, representation of a whole body of work. Since depositing objects through COPO naturally builds a large interconnected graph of metadata, ontological inferences can be made and used to suggest for example, further literature searches, experimental procedures or comparable datasets. Figure \ref{fig:COPO_ARCHITECTURE} shows a high level schematic view of COPO. Two pathways exist through the system. The first, for anonymous users, allows for querying only. Users can supply keywords to be searched in COPO’s metadata catalogue. The supplied search terms can be augmented with semantically derived results. For instance, utilising the Plant Ontology [REF] and Crop Ontology [REF] would allow COPO to understand what a gene is, is and its contextual place in terms of related genes, species in which the gene is observed, observed mutations, etc. and so on. These connections and inferences allow for a much richer set of results to be returned to the user than a simple text search would provide. When the user clicks on one of these results, they are taken to the COPO Profile in which the relevant research objects reside. Here they would see, for example, published manuscripts, original and processed sequence data, links to source code repositories, and analysis platform workflows. Therefore all of the information and material would be available to the user and the experiments described should become far more reproducible. The second pathway requires the user to login. By using technologies such as ORCiD \url{http://orcid.org/}, Twitter \url{https://twitter.com/} and OAUTH \url{http://oauth.net/}, a user will be able to login to COPO using existing credentials thus negating the need to create a new account and offloading some of the burden of security onto existing and trusted industry standard methods. Additionally, by integrating with services like ORCiD, COPO will be able to federate a researchers existing information into their COPO Profile, such as professional contact details, previous publications and collaborators. Once logged in, all the query capability described above will be available, as well as the ability to create COPO Profiles. A Profile can be thought of as the digital location of a complete body of research, fully attributed to one or more researchers. Responsive web forms allow the Profile to be properly labelled with metadata relating to creators, contributors, institutions, subject, sample, methodology, etc. To a Profile can be added Collections of objects. Such Collections are delimited by file type or function. For instance, one collection may contain all the sequence data associated with a study. Another collection may contain a number of published manuscripts and yet another might contain references to a number of source code repositories or analysis workflows. When creating these Collections, essential metadata can be attributed which properly describe the objects therein. By taking this metadata and integrating it with existing ontologies, COPO not only indexes the research objects passing through, but semantically enriches these objects. They are no longer simply collections of unstructured unrelated data, but entities described in terms of their similarity to other existing objects. It then becomes possible to make inferences about the kinds of things a researcher might be interested in, based on the samples, studies, manuscripts, source code, file types, abstracts, methodologies, references, and institutions, which reside within researchers' research objects. Since COPO is a brokering service, the raw data within a Profile is not physically stored on its servers for extended periods, as would be available in an archival service. Rather, once a collection of research objects has been uploaded to and labelled within a COPO Profile, they are seamlessly deposited into the relevant public repositories. Such repositories return unique accessions which are then be used to identify the deposited data files. These accessions are stored within the COPO Profile alongside the user-supplied metadata. If the data files are subsequently needed as part of a user query, they can be easily downloaded again from the repository. So, COPO's efficient and intuitive web interfaces allow for the input of important metadata, and by minting DOIs (Digital Object Identifiers) that identify COPO Profiles, these metadata can be published as persistent first class entities on the Internet. A resolution service, such as dx.doi.org, will direct users back to the COPO Profile identified by the DOI. Therefore, whether the DOI appears alongside a dataset, a paper or a code repository, the user can see all related research objects in a single view without having to search repositories and websites individually. Since DOIs guarantee data permanence, the system will alleviate the problem of "link decay" which occurs when resources referenced by a URL become permanently unavailable. The mapping of DOIs to research objects additionally allows for usage of these objects to be directly tracked and referenced enabling researchers to be properly recognised and credited for all the outputs they produce, not just by the typical route of publishing a paper which is not a comprehensive, truly digitally accessible, representation of a whole body of work. Since depositing objects through COPO naturally builds a large interconnected graph of metadata, ontological inferences can be made and used to suggest for example, further literature searches, experimental procedures or comparable datasets. This seamless interaction of deposition, metadata labelling, semantically-enabled searching and data attribution represents a novel way of gluing together existing services to greatly enhance what is currently available to plant scientists, thereby enabling them to find more relevant information more quickly. Most importantly, it enables them to deposit their research outputs into the public domain with little effort and get credited for doing so. \section{COPO Submission APIs} COPO builds on a rich set of APIs to facilitate the submission of research objects (e.g. raw sequencing data) to disparate data stores and repositories. These APIs are managed transparently within COPO to lift the burden of data deposition or transfer away from the user of the system. Some of the issues COPO’s submission APIs attempt to address include, but are not limited to, conversions between different formats (e.g. tab-delimited formats and XML, \cite{Rocca-Serra2010}), “big-data” transfer issues (e.g. delay overheads, data integrity and privacy), reproducing a piece of research or performing analysis on deposited data, etc. In Figure \ref{figure1}, we highlighted on the interaction of COPO with existing data infrastructure, the interoperability of which is made possible through the use of APIs. In what follows, we provide more specific discussions on the different APIs enabled by COPO for submissions to the different repositories and data stores captured in \ref{figure1}.