Patrick BROCKMANN
Software engineer at LSCE (Climate and Environment Sciences Laboratory)
Date: 13 November 2015
Release: 0.6
The Linked Paleo Data (LiPD) container based on the Linked Data JSON (JSON-LD) format is a practical solution to the problem of organizing and storing hierarchical paleoclimate data in a generalizable schema. This is an important step forward towards standardizing the representation and linkage of diverse paleoclimate datasets.
In this IPython notebook, I have experimental converters to interact with the LiPD container using ordinary spreadsheets. The motivation to create this method is guided by the fact that the paleoclimate community uses mainly spreadsheets to edit and store the data and the metadata of their measurements, and not JSON-based formats. What is missing is a way to convert such spreadsheet-based data to LiPD format and vice versa.
Working directly with LiPD has two other disadvantages:
Therefore, I propose to stick with the use of spreadsheets but standardize them into a structured spreadsheet where the data and the metadata are stored in two separate worksheets of the same spreadsheet document. The dot notation is used to represent the hierarchical nature of the metadata attributes. Following the nomenclature of LiPD, I call this structured spreadsheet PDS for Paleo Data Spreadsheet.
With a PDS, users can directly edit their data in an ordinary spreadsheet program like Excel or OpenOffice and later convert them to LiPD, which is a good container for storing data in a document database like mongoDB (since it uses JSON).
In addition, I have implemented converters to transform PDS to python pandas dataframes, which are convenient for subsequent data analysis in e.g. an IPython notebook.
The Data worksheet:
The Metadata worksheet:
Notes:
Currently, pandas does not yet support the reading and the writing of ODS (Open Document Spreadsheets) but there are many requests for this feature and it should be feasible soon (pandas/issue 2311).
The LiPD container refers to a headerless CSV file where the data are stored. Each column is therefore referenced only by a column number and is very poorly documented, and this could lead to confusion. I think it would be safer and clearer to use the first row as a header to name the columns by the parameter names.
The LiPD container can contain a list object for values, e.g. [{ "author": [{"name" : "N1"}, {"name" : "N2"}, {"name" : "N3"}] }]. It would be simpler to disable this possibility and have only unique values.
The spreadsheet cells must be formatted correctly, i.e. numbers cells are specified as number and not text. Same for dates.
A compliance checker needs to be built to check that input files conform to the PDS structure, something like the netCDF CF-checker.
The working group of PAGES (Past Global Changes) called 2K Network proposes to collect data using a PAGES2k/NOAA metadata template.
This template is very detailled but not easilly convertible to other structures such a pandas dataframes or LiPD container.
A RESTful web service could be implemented to visualize a PDS file as an HTML page with interactive plots. Bokeh, a python interactive visualization library, would be useful for this purpose.
PDSlib is a python module that offers two-way converters for different structures: PDS, pandas dataframes (df), and LiPD container.