ROUGH DRAFT authorea.com/100564
Main Data History
Export
Show Index Toggle 5 comments
  •  Quick Edit
  • Early experience with open data from CERN's Large Hadron Collider

    Abstract

    This chapter covers perspectives from the various partners who have worked on the release of large volumes of open research data from the Large Hadron Collider via the CERN Open Data Portal. The early experiences mentioned in the title refer to the launch of the Portal in November 2014 with the release of the first batch of high-level research data collected in 2010 by the CMS Collaboration. The chapter covers the motivation for releasing particle-physics data openly as well as the challenges faced in doing so and solutions developed to facilitate these efforts. The authors also touch upon the use cases of the open datasets and the impact the first release has had.

    Introduction

    Located on the outskirts of Geneva on both sides of the Swiss-French border, CERN, the European Laboratory for Particle Physics, is the world's premier research facility for accelerator-based high-energy physics. The laboratory's flagship accelerator is the Large Hadron Collider (LHC), which, at 27 kilometres in circumference, is the biggest and most powerful particle accelerator ever built. It accelerates protons in clockwise and anti-clockwise directions to nearly the speed of light before colliding them at four points on its ring. Gigantic particle detectors, operated by international collaborations of scientists and engineers, are located at each of these collision points. They record information from these collisions, generating an enormous volume of data for analysis: each collision event produces data of the order of megabytes (MB), and, with the LHC delivering collision events 40 million times each second, the CERN Data Centre has collected tens of petabytes since the accelerator began operations in 2010.

    The collisions produced by the LHC allow physicists to study the fundamental particles and forces of nature, and test the various theories and models that have been proposed to explain their behaviour. Members of the LHC collaborations perform this research by analysing the collision data and share the results in open-access publications. These datasets are truly unique in many ways, and, outside of the particle-physics community, not only are they of interest to students and data scientists, they also form an important part of the scientific legacy of the LHC.

    Now, a large portion of these data are being made publicly accessible without any restrictions to anyone in the world with an internet connection. In 2014, the CMS Collaboration released 27 terabytes (TB) of data that were recorded in the second half of 2010, representing half that year's data harvest.

    Following this successful release, and equipped with the lessons learnt from the experience, CMS released a second batch of data in April 2016\footnote{\url{http://cms.web.cern.ch/news/cms-releases-new-batch-research-data-lhc}} (300 TB in total, including 100 TB of collision data recorded in 2011). However, this chapter focuses on the first release from November 2014.

    Motivation

    The release of open data from the LHC is motivated, in part, by the desire to ensure that these data continue to remain available to researchers in the future. The four LHC collaborations have therefore adopted policies for data preservation and access\footnote{\url{http://opendata.cern.ch/collection/Data-Policies}}. These policies include related matters such as embargo periods, licensing and reuse. The specifications in the data-preservation policies distinguish four different data levels that help define the resulting recommendations\footnote{see \url{https://arxiv.org/abs/0912.0255}}:

    1. Level 1 data comprises data that is directly related to publications which provide documentation for the published results
    2. Level 2 data includes simplified data formats for analysis in outreach and training exercises
    3. Level 3 data comprises reconstructed data and simulations as well as the analysis level software to allow a full scientific analysis
    4. Level 4 covers basic raw level data (if not yet covered as level 3 data) and their associated software and allows access to the full potential of the experimental data

    The open data being discussed in this chapter refer to Levels 2 and 3.

    CERN is a strong proponent of openness in research, having worked with publishers and other laboratories to ensure that all particle-physics results from the LHC collaborations are published as open-access papers\footnote{see \url{https://scoap3.org/} and \url{https://cds.cern.ch/record/1955574}}. More recently this paradigm has been expanded to cover Open Science more comprehensively\footnote{\url{http://home.cern/cern-people/opinion/2014/11/road-open-science}}. CERN has helped build tools and digital libraries to foster Open Science practices beyond the particle-physics community, such as the widely used Invenio digital library software. Invenio forms the basis for projects such as ZENODO, which has helped thousands of scientists archive their work on CERN's servers with persistent identifiers. In order to facilitate storage for and access to open data from the LHC collaborations, the laboratory launched the CERN Open Data Portal, built using Invenio, in November 2014.

    All four LHC experiments provide open data in a format suitable for a classroom environment, typically used in the Physics Masterclasses\footnote{\url{http://physicsmasterclasses.org/}} for high-school students around the globe (see also Education under Section 5). In additional to these "educational" datasets, the CMS Collaboration, in accordance with its data-preservation and -access policies, has also released large volumes of high-quality data for research purposes along with the tools necessary for accessing and analysing them. This chapter describes the motivation behind the launch of the CERN Open Data Portal as well as the background to and experience with the first release of open data from the LHC.