Reproducible data management with Maven

A picture says less than 100 words

Use case: pharmacological reference data

Open PHACTS combines pharmacological data in a single Linked Data cache, and provide RESTful APIs to query and examine the data in a uniform interface. Data is loaded from several data providers like EBI and Uniprot - some of it is loaded directly as RDF, other datasets are translated to RDF from more specific data formats. A series of VoID linksets are also loaded to provide identity equivalence across the different RDF graphs. These linksets are derived from existing references in the earlier loaded data sources, but can also be found computationally (e.g. by comparing chemical structures or protein sequences).

Building Open PHACTS showed a challenge in how to keep all these data sources up to date, particularly when changing our architecture to allow for a Docker-based installation of the Open PHACTS platform and data at customer sites.

Open PHACTS platform is a service oriented architecture with components like a mySQL database, Virtuoso RDF store, memcache and PHP, Tomcat and ElasticSearch. We found Docker to be a great tool for managing and deploying these components in isolation, and combined with Docker Compose, provide an easy way to link them together to form a uniform and installable platform.

The mechanism of Dockerfile provides a reproducible way to build docker images, which can be created automatically by the Docker hub, typically based on a GitHub repository. Changes pushed to the GitHub repository causes a new Docker image to be built. Instructions in the Dockerfile consists of commands like ADD and RUN.

The challenge in setting up the Open PHACTS platform is the data loading. Docker images are stored as a series of differential file system layers, which can be pushed and pull from registries like the Docker Hub and third-party installations of the Docker Registry. This mechanism works well for typical Docker applications, where each layer can have a size in the magnitude of 100 MB, and the full Docker image a size of the magnitude 1 GB. We found that using this mechanism breaks down when used with the Open PHACTS data, which can be in the magnitude of 100 GB as uncompressed RDF.

Use case: Galaxy and NCBI Blast data

My open Gigascience review and re-review of Peter Cock's GigaScience article "NCBI BLAST+ integrated into Galaxy": (Cock 2015)

Open science - with Immediate response and impact:


  • Docker easy to install Galaxy and BLAST plugin - easy to review paper
  • But not the NCBI Blast reference data (10-100 GB)
  • Could not verify the workflows - not reproducible science
  • Proposed: BLAST Data manager plugin for Galaxy
  • NCBI Blast data changes nightly - which version did you use? No metadata - self-provenance. How to keep it up to date?
  • Suggestion:
    1. Data Docker to be used as an add-on with existing Galaxy plugin.
    2. Automated job for making new, VERSIONED NCBI artifacts.
    3. Inject provenance.
    4. Self-archival and identification of the reference data that was actually used, not just "Come try my Galaxy server, I promise I didn't update anything!"

Reproducibility in software development

Apache Maven

Apache Maven is a build and dependency management system that is widely popular for the Java platform. While newer build systems like Apache Ivy and Gradle have evolved to counter Maven's apparent verbosity and lack of flexibility, they still derive their dependency resolution and metadata aspects from Maven.

A software project that is built using Maven defines its compile settings and metadata in a pom.xml file. Maven has an extensive list of official and third-party plugins to cater for customized build requirement, but also strongly encourage a fixed directory structure (e.g. main source code in src/main/java and test resources in src/test/resources) and convention over configuration (Lazar 2009). The metadata can contain information about authors, contact details and source code repositories, which will be bundled with the produced binaries for provenance purposes, but can also be used by the Maven Site plugin to generate a website for the project.

Building the Maven project produces one or more Maven artifacts (e.g. a JAR file), identified with a group identifier (e.g. com.example.project1), an artifact identifier (project1-api), version (1.4.0-SNAPSHOT) and type (jar). Dependencies are specified using the same attributes, and the corresponding precompiled binaries are automatically retrieved from Maven repositories, including transitive dependencies found in the dependency's deployed pom file and any plugins required for building.

Maven repositories are HTTP-based directory structures, and include the Maven Central and Bintray JCenter which hosts a majority of open source JVM libraries, but can be hosted by any standard file-based HTTP server like Apache HTTPd and nginx. Organizations relying on Maven may be hosting in-house artifacts in private repository installations of JFrog Artifactory and Sonatype Nexus, which adds enterprise features such as rich access control, release mechanisms and transparent mirroring of artifacts from external Maven repositories.

Build systems like Maven lower the barrier of entry for developing and deploying Java-based open source software (citation needed), as its conventions provide a well-known structure for the source code and its build routine, but also because of its automatic management of dependencies and plugins, which ideally means that the only requirement for reproducibly building any Maven-based product would be Java and Maven itself.