IEDA EarthChem: Supporting the sample-based geochemistry community with data resources to accelerate scientific discovery
Kerstin A. Lehnert(1), Leslie Hsu(*1), Tiffany A. Rivera(2), J. Douglas Walker(3)
* Corresponding author: firstname.lastname@example.org (1) Lamont-Doherty Earth Observatory, Columbia University, Palisades, NY 10964, USA (2) Westminster College, Salt Lake City, UT 84105 (3) University of Kansas, Lawrence, KS 66045, USA
Three main points
1. Integrated geochemistry databases enable new science findings in diverse topics 1. EarthChem maintains a data file repository, synthesis databases, and a data portal 1. Disciplinary focus and user community input improve usability
Integrated sample-based geochemical measurements enable new scientific discoveries in the Earth sciences. However, integration of geochemical data is difficult because of the variety of sample types and measured properties, idiosyncratic analytical procedures, and the time commitment required for adequate documentation. To support geochemists in integrating and reusing geochemical data, EarthChem, part of IEDA (Integrated Earth Data Applications), develops and maintains a suite of data systems to serve the scientific community. The EarthChem Library focuses on dataset publication, accessibility, and linking with other sources. Topical synthesis databases (e.g., PetDB, SedDB, Geochron) integrate data from several sources and preserve metadata associated with analyzed samples. The EarthChem Portal optimizes data discovery and provides analysis tools. Contributing authors obtain citable DOI identifiers, usage reports of their data, and increased discoverability. The community benefits from open access to data leading to accelerated scientific discoveries. Growing citations of EarthChem systems demonstrate its success.
Geochemical compilations of enormous numbers of dense, statistically significant measurements have driven large, global-scale scientific discoveries. Examples include studies on diversity in MORB composition (e.g. Gale et al., 2013), global distributions of elements in the Earth’s derma layers (Rauch, 2011) and global patterns of intraplate volcanism (Conrad et al., 2011). New analytical methodologies allow for increasing rates of data collection that should translate to more ground-breaking scientific discoveries. With this anticipated increase, it is not feasible for single scientists to compile “all available global data” from the existing literature. This inability highlights the need for data systems to provide support for data discovery, access, and analysis to investigators, who are otherwise left with a disorganized heap of un-usable data.
The IEDA (Integrated Earth Data Applications) EarthChem data facility (http://www.earthchem.org) develops and operates digital data collections focused on the geochemistry of rocks and sediments from a wide range of global geographic settings. EarthChem citations show that its use is extending far beyond its rock and sediment geochemistry origins (http://www.earthchem.org/citations). For example, EarthChem has been cited in diverse scientific studies such as prediction of natural base-flow stream water chemistry (Olson et al., 2012), a prototype of a web-based relational database for archaeological ceramics (Hein et al., 2011), and strontium and oxygen isotope fingerprinting of green coffee beans and its potential to proof authenticity of coffee (Rodrigues et al., 2010).
The citations, both within geochemistry and petrology or extending to new innovative uses, demonstrate the utility of the databases to the scientific community. However, the utility comes only after much work to address the challenges and complexities of data and information standardization, lack of investigator contributions due to lack of time or willingness, time needed for organizing data extracted from the literature, and the development and maintenance of systems that are useful to and used by the community. Geochemistry is an example of a discipline in the "long tail" of data (Heidorn, 2008), where individual investigators and labs hold troves of data collected with one-of-a-kind newly developed techniques. This type of data has its own unique issues in data system development. Disciplinary expertise is extremely helpful for proper documentation of data and associated metadata for reuse. A recurring theme is how to balance quality control with the amount of documentation provided, while giving proper credit to the investigators who originally obtained the data.
In this contribution, we describe the origin and current capabilities of IEDA EarthChem resources for sample-based geochemical data, list the benefits of those resources for scientists, and highlight some of the derived scientific results. We describe the options available to investigators for submitting their data to the system and opportunities for scientific attribution. We show how EarthChem has addressed the challenges related to long-tail scientific data management and contributed to scientific output.
A sample-based data system stores observations that come from discrete samples, such as rocks, sediment, fluid, or other materials. Analytical measurements of the samples, descriptions of sampling location and techniques, analytical procedures of data collection, and pre-analysis sample preparation are stored in an integrated manner. Here, integration means alignment and standardization of vocabularies, sample names, and output. Multi-layered and interrelated pieces of information create additional challenges when compared to grid-based sensor data, (e.g. satellite, seismic, elevation) which may have better standardization and data formats. Sample-based databases often grow from single investigator interests and efforts, slowly gaining traction, data, and users, until they are morphed into an online, accessible system.
One of the first online sample based geochemical databases was PetDB, the Petrological Database, formerly the Petrological Database of the Ocean Floor. The database was built on a sample-based data model (Lehnert et al., 2000), which served as a foundational structure for several disciplinary databases that developed in the following decade, including SedDB (Lehnert et al., 2005), GEOROC (Sarbas et al., 2009), NAVDAT (Walker et al., 2006), and VentDB (Mottl, 2012). These databases combine data from numerous sources into a single relational synthesis database, allowing the rapid production of integrated datasets, and significantly reducing the time commitment that was previously necessary to manually compile the same data from the original sources.
The state of the art of geochemical data publication was laid out a decade ago by Staudigel et al. (2003) with the goal of initiating discussion of data formats and metadata in geochemistry at the “earliest stages of [geochemistry’s] exploitation of Information Technology”. Staudigel et al. (2003) highlight complexities within the organizational structure relating to standardization, conventions, lack of tabular data, and incomplete metadata. These issues have not disappeared, but management and mitigation have significantly improved and evolved. In the last decade, improvements such as governmental data policy statements [e.g. U.S. Office of Management and Budget Memo Open Data Policy—Managing Information as an Asset (M-13-13) [http://www.whitehouse.gov/sites/default/files/omb/memoranda/2013/m-13-13.pdf], endorsement of best practices, and stricter rules regarding data reporting were implemented by editors, reviewers, professional societies, and funding agencies (e.g. CODATA Scientific Data Policy Statements). Editors from several peer-reviewed journals that publish manuscripts containing geochemical data agreed on minimum standards for documentation about data quality, sample information, and the format and accessibility, which was published as the Editors Roundtable document “Requirements for the Publication of Geochemical Data” (Goldstein et al., 2014). The recommendations have been implemented by some journals, but strict enforcement is not yet common.
Data management software that works directly with the laboratory equipment is one of the most efficient ways to overcome the hurdle of initiating data management. In addition to the development of suggested reporting norms for geochemical data, the Geochron (www.geochron.org) software works directly with mass spectrometers and reduction programs in order to retain the essential sample metadata (Bowring et al., 2011; Walker et al., 2011). The automated software improves the workflow and streamlines the metadata preservation process by bringing data directly from the machine to data management and visualization software on the computer. Software of this type has greatly increased the ability of scientists to collect, manage, and publish data that can be easily contributed to sample-based databases.
While these types of software programs provide ease and accessibility to the instrument users, the maintenance of the hosting database is commonly performed by a different entity. Because the hosting database is connecting data from various input sources, maintaining integrated synthesis databases is an arduous task that involves sustaining controlled vocabularies, obtaining data from authors, and tracking data in a way that captures the complex metadata relationships. Increasingly, investigators are seeking rapid publication of their data, along with the ability to search multiple disciplinary databases at once. In order to address these needs and provide useful search and discovery tools, EarthChem has built several complementary systems to support its user community.