EcoInfo_Manuscript

Jessica Couture, Rachael E. Blake, Matthew B. Jones, Gavin McDonald, Colette Ward

* NOTE that authors after the first author are currently in alphabetical order. This can be changed following further discussion

WORK IN PROGRESS. DO NOT CITE.

Abstract


This is where the abstract goes.

Introduction

In order to enhance discovery, efficiency, and transparency in their work, many scientists have been promoting a revolutionary approach known broadly as open science. Within the broader open science theory there is a strong focus on data sharing, highlighting that, when made publicly available, data can have a much higher impact than data that is limited to the creator’s analysis (Mascarelli 2009, Gewen 2013, Obama 2013, Hampton et al. 2015). Although this movement has been gaining steam and a number of tools have been designed to facilitate openness at each step of the scientific process, those deeply involved in the development of open science acknowledge that the movement is still in its adolescence (Bjork 2004, Ram 2013, Hampton et al. 2015). Many researchers are encountering both technical and cultural difficulties in the application of this exciting new technique (Costello et al. 2013, Van Noorden 2013, Hampton et al. 2015).
 
In contrast to this new and exciting trend, many funders and publishers have long required data sharing or publication of their awardees and accepted submissions. Despite these established policies, weak enforcement has led to consistently low compliance (Wolins 1962, Wicherts et al. 2006, Savage & Vickers 2009, Alsheikh-Ali et al. 2011, Vines et al. 2014). The technical hurdles to sharing data have historically been many, and while digitization should help overcome these obstacles, many funders and publishers still fail to provide infrastructure and technical support for required data publication. In addition, an historic culture of perceived ownership of one’s data, combined with an environment of competition for funding and access to publication space serves as a cultural impediment to the adoption of open data sharing practices (Hampton et al. 2015, Sieber 1989).  In an effort to defend intellectual novelty and secure publication opportunities, scientists often withhold data from the larger scientific community. 
 
Although data requirements have been treated with similar leniency by both funders and publishers, the nature of these preconditions vary in that the funder shares ownership of the collected data, while the publisher is merely a platform and ownership of the publication is what could be disputed rather than the data itself. Therefore researchers could feel more compelled to comply with a funder’s requirements.  Alternatively, since publications are the currency of science, the converse might occur. A number of studies have evaluated the rate of journal-specific data reporting, but none have focused on funder success. There may be differences in the effectiveness of data sharing policies between these institutions. To test whether funding requirements result in different rates of data sharing than journal requirements, we wanted to determine if the recovery rate for a specific funding agency would differ from the results of journal specific data salvage efforts.
 
In addition to differences in data sharing based on who establishes the requirements, differences in data sharing may occur based on other characteristics such as the age of data (number of years since data were produced), research field or agency sector of the data collector. An increasing support for the open science movement would suggest an improved willingness to share data and underscore the hypothesis that more data should be available in recent years than in earlier history. William Michener et al. hypothesize a temporal degradation in the knowledge of one’s own data, concluding that the older data are the less information exists about the data and associated metadata both within the collector’s memory and the physical form of the data. Recent studies have supported this hypothesis (Michener 1997, Vines et al. 2014, Baker 2017), which is further compounded by an increased availability of data documentation and sharing tools. Larger and faster servers, more data sharing tools such as Github, cloud services and free online repositories, all of which have been popularized in the past decade, should lead to more data sharing overall than in previous decades (Hampton et al. 2015, Ram 2013, Reichman et al. 2011). Similarly, differences in data collection protocols, innovations in instrumentation, or confidentiality of information between research fields may lead to better data preservation in some disciplines. For example, data collected automatically or using well established protocols maybe be more easily shared than data requiring more complex processing or confidential data such as that involving human subjects. Research fields with more of these latter types of data many experience hurdles to sharing that could inhibit data sharing and preservation. Furthermore, a scientist’s agency affiliation may influence willingness or ability to share data. A private consulting agency may prefer to keep data private in order to protect client confidentiality or increase profitability.  In contrast, if a scientist collects data under a public agency, their department may compel or even require data publication (Obama 2013). Many public government agencies, have both external and internal data sharing policies, and are more likely to provide established protocols and systems of data sharing for their employees.
 
Here we assess the ability to retroactively archive ecological and environmental data and evaluate patterns in recovery of a funding body. To test these trends, we focused our study on the data-collection efforts of the Exxon Valdez Oil Spill Trustee Council (EVOSTC).  The EVOSTC, instituted in 1989 to manage monetary damages by Exxon Mobile following the Exxon Valdez oil spill in the Gulf of Alaska in 1989, has funded hundreds of projects since its inception. The EVOSTC requires the publication of data within one year of data collection for all recipients of their grants, but do not specify archiving methods nor provide a specific publication platform. The EVOSTC did make an effort to collect these data in the mid 1990s but the success of this effort is unknown as the content of this collection has since been lost.  EVOSTC grants have funded an array of government entities, private consulting firms and non-governmental organizations, as well as a few Alaskan native groups. The diversity of the grantees was also compounded by the variety of scientific disciplines under which they operated.  We wanted to know 1) for how many of these projects we could recover data? 2) Were there trends in data reporting based on data or grantee characteristics? 3) If data were not procured, why we were unsuccessful?



Methods

Data recovery and archiving
From 2012 to 2014, a team of one full-time and three part-time staff members was assigned to collect and archive data funded by the EVOSTC targeting specifically those projects funded between 1989 and 2010. Project information was obtained from the projects page on the EVOSTC website, which includes varying levels of detail for each project, ranging from only the project title to full bibliographic information and attached reports. Throughout our extensive data recovery effort, we took careful notes of our outreach efforts, communications and progress in publicly archiving acquired data.  We tracked the progress of the data request and acquisition process for each project based on six stages of increasing progress: “outreach attempted”, “contact established”, “data received”, “published”, “unrecoverable”.
 
Grantee contact information was obtained through agency websites and Google searches based on the information we were able to gather from the EVOSTC website. If we were able to find contact information for the listed grantee, an initial outreach email or phone call was made and efforts and communications were tracked on an internal ticketing system. The data support team conducted data outreach and provided data support in the form of data formatting and metadata creation in order to minimize barriers to data sharing. Recovered data were published to the Gulf of Alaska Historic Data Portal. At the close of the data recovery effort (fall 2014), we quantified the number of projects that fell into each of the six status labels to assess the final status of the data recovery effort.
 
Trends in data recovery
Projects were further characterized based on three project descriptors: research field of the project (field), agency sector of the home institution of the grantee (sector), age of data when data when recovery efforts were initiated (age). Within each of these categories a binary response (recovered or not recovered) was used, and status classifications were grouped as follows: recovered = “published”, “data received”; not recovered = “unrecoverable”, “outreach attempted”, “contact established”. Age of data was calculated as the number of years since the last year a project received EVOSTC funding since many projects spanned multiple years. This effort was initiated in 2012 so the most recent data pursued were two years old.
 
The impacts of these characteristics on the likelihood of providing data were assessed using discrete statistical tests: Chi-squared and random forest analyses. Chi-squared tests were run on each group of characteristics individually to test for differences in success based on these classifications. We then asked which of these characteristics were the most important in determining if the data from a project was successfully be recovered. To assess the relative importance of the three characteristics we conducted a random forest analysis in R version 3.3.3 using the “party” package and produced a classification tree to visualize how these characteristics affect our outcome (Hothorn et al, 2006a, Hothorn et al. 2006b, Strobl et al. 2007, Strobl et al. 2008). 
 
Hurdles to recovery
Also of interest were the reasons data were not recovered. To quantify these responses we categorized the projects for which we were unable to gather any data based on notes from our correspondence.  The projects included in this analysis were those labeled from the not recovered grouping.  Reasons were only recorded when we had direct confirmation through communications, otherwise projects that were not recovered were characterized based on their status (eg. The projects from the “outreach attempted” status have been added to the no contact information group since we were never able to confirm the outreach efforts actually reached the target recipient.) Non-digital data is deemed “unrecoverable” here since our project lacked the resources to convert or store such data. Where possible, we have since digitized these data and published in the archive.