ROUGH DRAFT authorea.com/90671
Main Data History
Export
Show Index Toggle 0 comments
  •  Quick Edit
  • Proposal for a structured spreadsheet for Paleo Data to interface with the Linked Paleo Data container

    Patrick BROCKMANN
    Software engineer at LSCE (Climate and Environment Sciences Laboratory)



    Date: 13 November 2015
    Release: 0.6

    Proposal for a structured spreadsheet for Paleo Data to interface with the Linked Paleo Data container

    The Linked Paleo Data (LiPD) container based on the Linked Data JSON (JSON-LD) format is a practical solution to the problem of organizing and storing hierarchical paleoclimate data in a generalizable schema. This is an important step forward towards standardizing the representation and linkage of diverse paleoclimate datasets.

    In this IPython notebook, I have experimental converters to interact with the LiPD container using ordinary spreadsheets. The motivation to create this method is guided by the fact that the paleoclimate community uses mainly spreadsheets to edit and store the data and the metadata of their measurements, and not JSON-based formats. What is missing is a way to convert such spreadsheet-based data to LiPD format and vice versa.

    Working directly with LiPD has two other disadvantages:

    1. JSON notation will never be as easy to edit and modify than a spreadsheet document;
    2. The LiPD container refers to a headerless CSV data file which requires the user to continually navigate the nested attributes of the LiPD file in order to figure out which parameters correspond to which columns in the CSV file.

    Therefore, I propose to stick with the use of spreadsheets but standardize them into a structured spreadsheet where the data and the metadata are stored in two separate worksheets of the same spreadsheet document. The dot notation is used to represent the hierarchical nature of the metadata attributes. Following the nomenclature of LiPD, I call this structured spreadsheet PDS for Paleo Data Spreadsheet.

    With a PDS, users can directly edit their data in an ordinary spreadsheet program like Excel or OpenOffice and later convert them to LiPD, which is a good container for storing data in a document database like mongoDB (since it uses JSON). In addition, I have implemented converters to transform PDS to python pandas dataframes, which are convenient for subsequent data analysis in e.g. an IPython notebook.

    The Paleo Data Spreadsheet (PDS) structure

    • 2 worksheets: Data and Metadata.
    • The Data worksheet presents the data in a matrix with named columns for each parameter and rows for each sample, as in a CSV file.
    • The Metadata worksheet has 2 columns corresponding to the Attribute and Value of all the parameters described in the Data worksheet. There are no headers. Hierarchical attributes are denoted by the dot notation (e.g. parameter.attribute) in a Attribute column and their values are contained in a Value column. If there is no corresponding parameter in the Data worksheet, then it is assumed that the attribute is global (e.g. filename).
    • Missing data are left as empty cells.
    • No comments are allowed in the Data and Metadata worksheets but other worksheets can be added for this purpose.

    The Data worksheet:


    The Metadata worksheet:


    Notes:

    1. Currently, pandas does not yet support the reading and the writing of ODS (Open Document Spreadsheets) but there are many requests for this feature and it should be feasible soon (pandas/issue 2311).

    2. The LiPD container refers to a headerless CSV file where the data are stored. Each column is therefore referenced only by a column number and is very poorly documented, and this could lead to confusion. I think it would be safer and clearer to use the first row as a header to name the columns by the parameter names.

    3. The LiPD container can contain a list object for values, e.g. [{ "author": [{"name" : "N1"}, {"name" : "N2"}, {"name" : "N3"}] }]. It would be simpler to disable this possibility and have only unique values.

    4. The spreadsheet cells must be formatted correctly, i.e. numbers cells are specified as number and not text. Same for dates.

    5. A compliance checker needs to be built to check that input files conform to the PDS structure, something like the netCDF CF-checker.

    6. The working group of PAGES (Past Global Changes) called 2K Network proposes to collect data using a PAGES2k/NOAA metadata template. This template is very detailled but not easilly convertible to other structures such a pandas dataframes or LiPD container.

    7. A RESTful web service could be implemented to visualize a PDS file as an HTML page with interactive plots. Bokeh, a python interactive visualization library, would be useful for this purpose.

    Converting from one structure to another with PDSlib

    PDSlib is a python module that offers two-way converters for different structures: PDS, pandas dataframes (df), and LiPD container.

    • aLiPD = PDStoLiPD(file, verbose=True)
      • Convert a PDS to a LiPD container
      • If verbose is True, warning messages will be printed
    • LiPD_to_PDS(aLiPD, file)
      • Convert a LiPD container to a PDS
    • dfD, dfM = PDS_to_df(file)
      • Convert a PDS to 2 pandas dataframes for Data and Metadata, respectively
    • df_to_PDS(file, dfD, dfM)
      • Convert 2 pandas dataframes for Data and Metadata to a PDS
    • aLiPD = df_to_LiPD(dfD, dfM, verbose=True)
      • Convert 2 pandas dataframes for Data and Metadata to a LiPD container
      • If verbose is True, warning messages will be printed
    • dfD, dfM = LiPD_to_df(aLiPD)
      • Convert a LiPD container to 2 pandas dataframes for Data and Metadata, respectively
    • df_plot(dfD, xCol, [yCols])
      • Convert a data pandas dataframe to an interactive plot using xCol and yCols, respectively as xaxis and yaxis
      • The function relies on bokeh module

    The full notebook can be seen from http://nbviewer.ipython.org/github/PBrockmann/Paleo_Data_Spreadsheet/blob/master/Paleo_Data_Spreadsheet.ipynb