Authorea

Julien Emile-Geay edited Introduction.tex about 9 years ago

Commit id: 67ef4bda1fc6f83ce6aef47850db66b36917b0e0

deletions | additions

\section{Introduction} Science is entering a data-intensive era, where insight is increasingly gained by extracting information from large volumes of data \cite{Hey_2012}. This is particularly critical in paleoclimatology, as understanding past changes in climate system requires observations across large spatial and temporal scales. Paleoclimatic observations are typically limited to small geographic domains, so investigating large scales requires integrating many disparate studies and datasets. Observational work in paleoclimatology exemplifies the ``long-tail'' approach to data collection \cite{P_Bryan_Heidorn_2008}: the majority of observations are gathered by independent scientists with no formal language for describing their data and meta-data to each other -- or to machines -- in a standardized fashion. This results in a ``Digital Tower of Babel'', making the curation, access, re-use and valorization of paleoclimate data far more difficult than it should be, hindering scientific progress. Recognizing the need for data sharing, paleoclimate investigators have made a major effort over the past decade to make their data available to the broader community, largely through online archiving systems like the \href{http://www.ncdc.noaa.gov/paleo/wdc-paleo.html}{World Data Center for Paleoclimatology} and \href{http://www.pangaea.de/}{Pangaea} . \href{http://www.pangaea.de/}{Pangaea}. Nonetheless, the lack of consistent formatting and metadata standards has made the re-use of such data needlessly labor-intensive by preventing computers from participating in the task of making connections across datasets. As the number of records in these archives has grown, making connections manually has become more and more challenging, hampering integrative efforts at the very time they should be flourishing. Achieving this goal requires addressing two major hurdles: (1) the lack of an accepted data container for paleoclimate data; (2) the lack of a community standard for such data. These two issues are clearly related, but somewhat distinct in practice. The data container must be universally readable, a condition satisfied by, for instance, netCDF files, which have been used for paleoclimate syntheses\cite{Wahl_2010}. syntheses \cite{Wahl_2010}. However, such files only allow for fixed schemas and require identical fields for all proxies. In reality, each proxy dataset may have a unique set of data and metadata properties. For broader applicability, we thus require a more flexible format. Further, to enhance the relevance of paleoclimate data to other fields, one would like this data container to be compatible with the Linked Data paradigm \cite{Bizer_2009}, which allows for data-driven discovery between datasets that would otherwise be unlikely or impossible. In this technical note, we present LiPD (Linked Paleo Data) a new, flexible linked-data container designed for paleoclimate data. Such a data container is a necessary first step towards a ``semantic web of paleoclimatology'' \cite{Emile_Geay_2013}, and provides a straightforward framework in which communities and researchers can explicitly describe their data and metadata in common terms that the community, and computers, can understand. In the process, we introduce a preliminary data standard for paleoclimatology. Indeed, such a standard is essential to structuring the metadata, though the container is flexible enough to accommodate many revisions and udpates. Ideally, such a standard would proceed from a community-wide discussion, and the establishment of a consensus, which has yet to take place in our field. One goal of the present work is to spark such a discussion by giving the worldwide paleoclimate community a strawman to improve upon.