The Linked Paleo Data framework: a common tongue for paleoclimatology
Paleoclimatology is a highly collaborative scientific endeavor, increasingly reliant on online databases for data sharing. Yet, there is currently no universal way to describe, store and share paleoclimate data: in other words, no standard. Data standards are often regarded by scientists as mere technicalities, though they underlie much scientific and technological innovation, as well as facilitating collaborations between research groups. In this article, we propose a preliminary data standard for paleoclimate data, general enough to accommodate all the proxy and measurement types encountered in a large international collaboration (PAGES2K). We also introduce a vehicle for such structured data (Linked Paleo Data, or LiPD), leveraging recent advances in knowledge representations (Linked Open Data).
The LiPD framework enables quick querying and extraction, and we expect that it will facilitate the writing of open-source, community codes to access, analyze, model and visualize paleoclimate observations. We welcome community feedback on this standard, and encourage paleoclimatologists to experiment with the format for their own purposes.
Science is entering a data-intensive era, where insight is increasingly gained by extracting information from large volumes of data (Hey 2012). This is particularly critical in paleoclimatology, as understanding past changes in climate system requires observations across large spatial and temporal scales. Paleoclimatic observations are typically limited to small geographic domains, so investigating large scales requires integrating many disparate studies and datasets. Observational work in paleoclimatology exemplifies the “long-tail” approach to data collection (Heidorn 2008): the majority of observations are gathered by independent scientists with no formal language for describing their data and meta-data to each other – or to machines – in a standardized fashion. This results in a “Digital Tower of Babel”, making the curation, access, re-use and valorization of paleoclimate data far more difficult than it should be, hindering scientific progress.
Recognizing the need for data sharing, paleoclimate investigators have made a major effort over the past decade to make their data available to the broader community, largely through online archiving systems like the World Data Center for Paleoclimatology and Pangaea. Nonetheless, the lack of consistent formatting and metadata standards (i.e. a common tongue) has made the re-use of such data needlessly labor-intensive by preventing computers from participating in the task of making connections across datasets. As the number of records in these archives has grown, making connections manually has become more and more challenging, hampering integrative efforts at the very time they should be flourishing. Paleoclimatologists thus need a common tongue to describe their datasets to each other and to machines. Achieving this goal requires addressing two major hurdles: (1) the lack of an accepted data container for paleoclimate data; (2) the lack of a community standard for such data.
These two issues are clearly related, but somewhat distinct in practice. The data container must be universally readable, a condition satisfied by, for instance, netCDF files, which have been used for paleoclimate syntheses (Wahl 2010). However, such files only allow for fixed schemas and require identical fields for all proxies. In reality, each proxy dataset may have a unique set of data and metadata properties. For broader applicability, we thus require a more flexible format. Further, to enhance the relevance of paleoclimate data to other fields, one would like this data container to be compatible with the Linked Data paradigm (Bizer 2009), which allows for data-driven discovery between datasets that would otherwise be unlikely or impossible.
In this technical note, we present LiPD (Linked Paleo Data) a new, flexible linked-data container designed for paleoclimate data. Such a data container is a necessary first step towards a “semantic web of paleoclimatology” (Emile-Geay 2013), and provides a straightforward framework in which communities and researchers can explicitly describe their data and metadata in common terms that the community, and computers, can understand. In the process, we introduce a preliminary data standard for paleoclimatology. Indeed, such a standard is essential to structuring the metadata, though the container is flexible enough to accommodate many revisions and updates. Ideally, such a standard would proceed from a community-wide discussion, and the establishment of a consensus, which has yet to take place in our field. One goal of the present work is to spark such a discussion by giving the worldwide paleoclimate community a strawman to improve upon.
This article is structured as follows: In section 2 we describe the new container, LiPD. In section 3 we describe the proposed metadata standard. We close with a discussion section.
Some base metadata about the dataset (e.g.)
Identifiers (dataset name, version number, dataset DOI, investigators)
Geographic metadata (e.g.,)
latitude, longitude, elevation above or depth below sea level
Publication metadata (e.g.,)
DOI (which resolves the following information)
authors, title, journal, publication date
Proxy data and metadata, including:
One or more tables of measurements, and their metadata
Parameter names, units, standards, and interpretations (including forward models)
Geochronological data and metadata, which can include
Table(s) of radiometric dating measurements and associated metadata