A flexible container for paleoclimate data

Paleoclimate observations come in many varieties; standardizing the data and providing the framework to encode meaning to the parameters and metadata requires a flexible, and extensible format. The linked data variety of JavaScript Object Notation (JSON-LD), provides a lightweight, and human-readable solution to this problem. JSON-LD is almost infinitely customizable, here we present conventions for Linked Paleo Data (LiPD), which utilizes JSON-LD and provides a structure that is common to the overwhelming majority (if not all) of paleoclimate observational datasets. Despite their variety, all paleoclimate datasets, commonly comprised of both direct observations and inferred variability, share the same major features.

  1. Some base metadata about the dataset (e.g.)

    1. Identifiers (dataset name, version number, dataset DOI, investigators)

    2. Archive1 type

  2. Geographic metadata (e.g.,)

    1. latitude, longitude, elevation above or depth below sea level

    2. site name

  3. Publication metadata (e.g.,)

    1. DOI (which resolves the following information)

    2. authors, title, journal, publication date

  4. Proxy data and metadata, including:

    1. One or more tables of measurements, and their metadata

    2. Parameter names, units, standards, and interpretations (including forward models)

  5. Geochronological data and metadata, which can include

    1. Table(s) of radiometric dating measurements and associated metadata

    2. Age model ensembles

    3. Author interpretation and methodological choices

LiPD encodes these data and metadata into a structured hierarchy that allows explicit description of any aspect of the dataset at any level of the data (Figure 1). LiPD serializes this hierarchy using JSON-LD, using nests of lists and key-value pairs. LiPD adopts the GeoJSON standard to describe the geospatial metadata of a given site like this:

"geo": { "type": "Feature", "geometry": { "type": "Point", "coordinates": [-17.82, 62.08, -1938] }, "properties": { "siteName": "RAPiD-12-1K, South Iceland Rise, northeast North Atlantic" } },

Note that the GeoJSON standard defaults to the WGS84 ellipsoid, and units of decimal degrees for latitude and longitude and meters above sea level for elevation. This standard readily accommodates polygonal geographic features and additional location metadata.

LiPD adopts the Linked Data extension of BibJSON \cite{johnson2013indexing} to describe publication metadata, for example:

"pub": { "author": [ {"name" : "Thornalley, D.J.R"}, {"name" : "Elderfield, H.}, {"name" : "McCave, N"} ] "type" : "article" "identifier" : [ {"type": "doi", "id": "10.1038/nature07717", "url": "http://dx.doi.org/10.1038/nature07717"} ], "pubYear": 2009 },

For the paleoData and chronData components of LiPD, which include tabular data, LiPD does not store the actual tabular data in the JSON-LD format, as this becomes increasingly verbose and inefficient with large data tables. Rather, the tabular data are stored in headerless comma separated value (CSV) files that are referenced and described by the JSON-LD file, using the W3C’s CSV on the Web working groups recommendations, like this:

"paleoData": [{ "paleoDataTableName": "data", "filename": "Atlantic0220Thornalley2009.csv", "columns": [{ "number": 1, "parameter": "depth", "parameterType": "measured", "description": "depth below ocean floor", "units": "cm", "datatype": "csvw:NumericFormat", "notes": "depth refers to top of sample" }, { "number": 2, "parameter": "year", "parameterType": "inferred", "description": "calendar year AD", "units": " AD", "datatype": "csvw:NumericFormat", "method": "linear interpolation" }, { "number": 3, "parameter": "temperature", "parameterType": "inferred", "description": "sea-surface temperature inferred from Mg/Ca ratios", "datatype": "csvw:NumericFormat", "material": "foramifera carbonate", "calibration": { "equation": "BAR2005: Mg/Ca=0.794*exp(0.10*SST)", "reference": "Barker et al., (2005), Thornalley et al., (2009)", "uncertainty": 1.3 }, "units": "deg C", "proxy": "Mg/Ca", "climateInterpretation": { "parameter": "T", "parameterDetail": "seaSurface", "seasonality": "MJJ", "interpDirection": "positive", "basis": "Mg/Ca calibration to SST" } }] }]

Describing the columns in the datatable in the LiPD framework allows explicit encoding of key metadata that are commonly lost or misunderstood in current data structures. For example, the “climateInterpretation” section above allows the scientist to explicitly describe the details of how the parameter “senses” climate. When encoded as above and explicitly defined and linked, the knowledge that this record is interpreted to record May through July sea surface temperature, and that those temperature estimates were derived from the Mg/Ca calibration equations of \citet{Barker_2005} and \citet{Thornalley_2009} becomes built into the dataset, and readable to both people and computers. It’s queryable, and linked to other datasets, and transparent when datasets are used in ways that are outside the published interpretations.

An advantage of using JSON as the default container for this information is that it is an extremely common vehicle for all manner of data, and can be parsed by nearly all modern programming languages. As each LiPD dataset is comprised of a JSON-LD file and one or more csv files; each dataset is packaged using BagIt 2, which provides a simple format for collecting and validating files for distribution, and that can be readily serialized into a compressed file for exchange between users. To facilitate input and output of LiPD datasets, we are developing code to easily export LiPD datasets into and out of Matlab (as structured arrays), R (as lists) and Python (as dictionaries).


  1. the archive is the medium in which the paleoclimatic signal is imprinted: e.g. coral aragonite, ice core, foraminiferal calcite in a sediment core, etc...

  2. https://en.wikipedia.org/wiki/BagIt