University of Manchester – School of Computer Science

by author

by title

by keyword

Reproducible data management with Maven

Stian Soiland-Reyes

June 04, 2015

Use case: pharmacological reference data Open PHACTS combines pharmacological data in a single Linked Data cache, and provide RESTful APIs to query and examine the data in a uniform interface. Data is loaded from several data providers like EBI and Uniprot - some of it is loaded directly as RDF, other datasets are translated to RDF from more specific data formats. A series of VoID linksets are also loaded to provide identity equivalence across the different RDF graphs. These linksets are derived from existing references in the earlier loaded data sources, but can also be found computationally (e.g. by comparing chemical structures or protein sequences). Building Open PHACTS showed a challenge in how to keep all these data sources up to date, particularly when changing our architecture to allow for a Docker-based installation of the Open PHACTS platform and data at customer sites. Open PHACTS platform is a service oriented architecture with components like a mySQL database, Virtuoso RDF store, memcache and PHP, Tomcat and ElasticSearch. We found Docker to be a great tool for managing and deploying these components in isolation, and combined with Docker Compose, provide an easy way to link them together to form a uniform and installable platform. The mechanism of Dockerfile provides a reproducible way to build docker images, which can be created automatically by the Docker hub, typically based on a GitHub repository. Changes pushed to the GitHub repository causes a new Docker image to be built. Instructions in the Dockerfile consists of commands like ADD and RUN. The challenge in setting up the Open PHACTS platform is the data loading. Docker images are stored as a series of differential file system layers, which can be pushed and pull from registries like the Docker Hub and third-party installations of the Docker Registry. This mechanism works well for typical Docker applications, where each layer can have a size in the magnitude of 100 MB, and the full Docker image a size of the magnitude 1 GB. We found that using this mechanism breaks down when used with the Open PHACTS data, which can be in the magnitude of 100 GB as uncompressed RDF.

Apache Taverna Language: Semantic and flexible workflow definitions

Stian Soiland-Reyes

and 4 more

September 22, 2014

Authors: Stian Soiland-Reyes 1,2, David Withers, Alan R Williams1,2, Donal Fellows1,2, Matthew Gamble2, Carole Goble2 1 Apache Software Foundation; 2 University of Manchester This article describes the workflow language of Apache Taverna , in particular focusing on its workflow language _SCUFL2_ and the abstract semantic workflow model _wfdesc_. The SCUFL2 API allows construction and inspection of Taverna 3 workflows from independent applications, but also enables translation from/to different abstract and concrete third-party workflow formats (SHIWA IWIR, MG-RAST AWE) and the Common Workflow Language This includes the general semantic model wfdesc which we have created within the Wf4Ever project for the purpose of workflow preservation and annotation. wfdesc is easily combined with W3C PROV-based workflow provenance, and is also used by the digital preservation project SCAPE to find and compose semantically described Workflow Components from the myExperiment workflow repository.

Bundling linked research materials

Stian Soiland-Reyes

April 29, 2013

Stian Soiland-Reyes, Kevin Page?, Khalid Belhajjame, Jun Zhao?, David De Roure?, Carole Goble Describing http://purl.org/wf4ever/ro-bundle and https://github.com/wf4ever/robundle Abstract Research Objects (RO) has been suggested as a mechanism to preserve digital research materials and their relationships and annotations. This is a key factor in improving reproducibility of modern science which significantly depends on computational analysis and processing. [ref?]. The support for the W3C Research Object for Scholarly Communication Community Group highlights the need for rolling out the concept of research object across scientific publications. While the existing Research Object model allow the creation of research objects using Linked Data and RESTful web services, a considerable amount of scientific software do not directly deal with the web. Therefore issues such as minting URIs or publishing Linked Data become troublesome [ref?], as this kind of software typically stores methods and results as files on a local or distributed file system. Although the Linked Data approach to building a Research Object uses URI references to relate the aggregated resources, scientists still mentally distinguish between “local” or “composed” resources, and “external” or “referenced” resources. While support for managing composed collections have been proposed in the Linked Data Platform, there are still issues relating to distribution and cloning of such collections. In this paper we present the Research Object Bundle, a ZIP-based media format that formalizes how to create a single file that bundles both the RO descriptions and annotations, but also the files the scientists desire to distribute embedded with the research object. The RO bundle format forms the basis for specifying application-specific bundles, and we explore how the scientific workflow system has implemented bundles for distributing complex data values and complete workflow run provenance. We then examine how a different workflow system, GridSpace, can use RO bundles to distribute snapshots of workflow runs between installations. In order to improve uptake, the RO bundle format uses a JSON-LD context to describe the manifest. This means that we don’t require the developer to understand linked data concepts or how to mint URIs, knowing JSON and a brief understanding of the Research Object model (such as aggregations, annotations and provenance), together with how to create a ZIP file, is sufficient to create an RO bundle. RO bundles, as files, are easily distributed, for instance as email attachments, on institutional file servers or published on the web. This raises a challenge with respect to the identity of the research object and its evolution; if two people publish the same RO bundle at two different locations, are those then representing the same RO? What if one of the ZIP files is updated with an additional resource? We resolve this issue by simply declaring any RO bundle as an independent RO snapshot; which itself is unidentified (beyond how it is accessed). Within the RO bundle we relate resources using relative URI references, but also optionally include an RO evolution trace, where the Live RO that the bundle was created from can be identified. Integrating RO bundles into the existing Research Object Linked Data cloud can be achieved simply by unzipping the bundle and processing the manifest from the linked JSON-LD context. We demonstrate how RO bundles from the examined applications have been integrated into the RO frameworks used by the myExperiment site.