Julien Emile-Geay edited Introduction.tex  over 9 years ago

Commit id: c377ed24bb0bfdaecc3895ef059eb6303421a9d9

deletions | additions      

       

\section{Introduction}  Science is entering a data-intensive era, where insight is increasingly gained by extracting information from large volumes of data \cite{Hey_2012}. This is particularly critical in paleoclimatology, as understanding past changes in climate system requires observations across large spatial and temporal scales. Paleoclimatic observations are typically limited to small geographic domains, so investigating large scales requires integrating many disparate studies and datasets. Observational work in paleoclimatology exemplifies the ``long-tail'' approach to data collection \cite{P_Bryan_Heidorn_2008}: the majority of observations are gathered by independent scientists with no formal language for describing their data and meta-data to each other -- or to machines -- in a standardized fashion. This results in a ``Digital Tower of Babel'', making the curation, access, re-use and valorization of paleoclimate data far more difficult than it should be, hindering scientific progress.  Recognizing the need for data sharing, paleoclimate investigators have made a major effort over the past decade to make their data available to the broader community, largely through online archiving systems like the \href{World Data Center for Paleoclimatology}{http://www.ncdc.noaa.gov/paleo/wdc-paleo.html} and \href{Pangaea}{http://www.pangaea.de/} . Nonetheless, the lack of consistent formatting and metadata standards has made the re-use of such data needlessly labor-intensive by preventing computers from participating in the task of making connections across datasets. As the number of records in these archives has grown, making connections manually has become more and more challenging, hampering integrative efforts at the very time they should be flourishing. Achieving this goal requires addressing two major hurdles: (1) the lack of an accepted data container for paleoclimate data; (2) the lack of a community standard for such data.   These two issues are clearly related, but somewhat distinct in practice. The data container must be universally readable, a condition satisfied by, for instance, netCDF files, which have been used for paleoclimate syntheses\cite{Wahl_2010}. However, such files only allow for fixed schemas and require identical fields for all proxies. In reality, each proxy dataset may have a unique set of data and metadata properties. For broader applicability, we thus require a more flexible format. Further, to enhance the relevance of paleoclimate data to other fields, one would like this data container to be compatible with the Linked Data paradigm \cite{Bizer_2009}, which allows for data-driven discovery between datasets that would otherwise be unlikely or impossible.   In this technical note, we present LiPD (Linked Paleo Data) a new, flexible linked-data container designed for paleoclimate data. Such a data container is a necessary first step towards a ``semantic web of paleoclimatology'' \cite{Emile_Geay_2013}, and provides a straightforward framework in which communities and researchers can explicitly describe their data and metadata in common terms that the community, and computers, can understand. In the process, we introduce a preliminary data standard for paleoclimatology. Indeed, such a standard is essential to structuring the metadata, though the container is flexible enough to accommodate many revisions and udpates. Ideally, such a standard would proceed from a community-wide discussion, and the establishment of a consensus, which has yet to take place in our field. One goal of the present work is to spark such a discussion by giving the worldwide paleoclimate community a strawman to improve upon.   This article is structured as follows: In section 2 we describe the new container, LiPD. In section 3 we describe the proposed metadata standard. In section 4 we demonstrate the utility of this framework for analyzing a large multiproxy dataset \cite{Ahmed_2013}. We close with a discussion section.  A clear solution to this problem is to \textbf{establish data and metadata standards for paleoclimatology}. Standardization would pave the way for many radical improvements in paleoclimatology, as it has in any other field of science or industry. Firstly, it would permit crowd-source data curation, which would relieve a significant burden from data curators and bring more dark data to light. Secondly, it would enable universal, open-source software libraries to be built, ensuring that the whole community has access to sound, state-of-the-art tools to process, analyze, compare and model their data. Thirdly, it would allow semantic technologies to enter the realm of paleoclimatology, thus enabling the tremendous apparatus of the machine-learning and artificial intelligence communities to discover new patterns in the data. It would also uncover relationships with other Linked Open Data \citep{BHBL09}, both in and outside the geosciences.   To achieve these lofty goals, however, requires unprecedented levels of community consultation, cooperation and consensus.  Fortunately, techn  The Linked Data paradigm \cite{Bizer_2009} was designed to address this type of problem, and to allow for data-driven discovery between datasets that would be unlikely or impossible otherwise. In this technical note, we present a new, flexible linked-data container designed for paleoclimate data. Such a data container is a necessary first step towards a ``semantic web of paleoclimatology'' \cite{Emile_Geay_2013}, and provides a straightforward framework in which communities and researchers can explicitly describe their data and metadata in common terms that the community, and computers, can understand.