Authorea

Anthony Etuk edited section_COPO_Submission_APIs_COPO__.tex almost 9 years ago

Commit id: 9555d6b50988508b6285af6ea4baca28587012cf

deletions | additions

\section{COPO Submission APIs} COPO builds on a rich set of APIs to facilitate the submission of research objects (e.g. raw sequencing data) to disparate data stores and repositories. These APIs are managed transparently within COPO to lift the burden of data deposition or transfer away from the user of the system. Some of the issues COPO’s submission APIs attempt to address include, but are not limited to, conversions between different formats (e.g. tab-delimited formats and XML, \cite{Rocca-Serra2010}), “big-data” “big data” transfer issues (e.g. delay overheads, data integrity and privacy), reproducing a piece of research or performing analysis on deposited data, etc. In Figure \ref{figure1}, we highlighted on the interaction of COPO with existing data infrastructure, the interoperability of which is made possible through the use of APIs. In what follows, we provide more specific discussions on the different APIs enabled by COPO for submissions to the different repositories and data stores captured in \ref{figure1}. \subsection{ENA Submission APIs} The European Nucleotide Archive (ENA) is an established repository for storing nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation (http://www.ebi.ac.uk/ena). As a matter of fact, the provision of nucleotide sequence data to ENA is currently regarded as a mandatory step in the dissemination of research outputs to the scientific community \cite{?}. The data submission workflow of ENA entails that data files must be uploaded before they can be submitted. To achieve this, ENA provides support for several software components to assist users in submitting data to the repository. Among them is fasp transfer technology, which is especially recommended by ENA for long distance file transfers. The fasp transfer technology, developed by Aspera (http://asperasoft.com), eliminates the fundamental shortcomings of conventional TCP-based file transfer technologies such as FTP and HTTP. Data transfer with fasp is recorded as achieving speeds that are hundreds of times faster than FTP/HTTP \cite{Marx_2013}. COPO provides an API (with a web-based interface), which builds on the Aspera fasp transfer technology, for uploading files to users’ dropbox in ENA. Using this functionality (activated with just a single click), the user can monitor the progress of the uploaded data. Also, metadata about the upload process (e.g. time of upload or process initiator) can be recorded and made available to other components of the system. The ISA API is also used within this context to enable conversions to data formats (e.g. XML) supported by ENA before a submission can be made. Once a submission is completed, as enabled by this array of APIs, an accession is obtained and maintained within COPO.

... \subsection{iRODS Submission APIs} As \cite{Marx_2013} suggests, there is no reason to move data outside a remote infrastructure (e.g. the cloud); analysis can be done right there. COPO enables such a continuum by providing a seamless integration with heterogeneous storage assets that are abstracted away from individual useror organisation physical storage. In particular, COPO integrates with iRODS - Integrated Rule-Oriented Data System (https://irods.org/), an open source data management package, to offload the burden of "big data" management from the system thus, enhancing its service-brokering objectives. iRODS enables data virtualisation and allows access to distributed storage assets under a unifying namespace. iRODS enables data discovery using a metadata catalog that describes files, directories, and storage resources in the data grid \cite{rajasekar2010irods}. In the context of COPO, iRODS will give plant scientists the ability to create virtual data archives, where their data are seamlessly preserved and curated with policy-based rules. Data files uploaded to COPO are automatically routed to a connected iRODS instance, thus removing the need to physically store those objects in COPO. A key part of this integration (with iRODS) is achieved through PyRods (http://code.google.com/p/irodspython), an open source Python client API for accessing an iRODS server. The PyRods micro-service enables the management of data objects in iRODS, including registering, metadata attribution, and retrieving data objects.