Main Data History
Show Index Toggle 30 comments


Jessica Couture, Rachael E. Blake, Matthew B. Jones, Gavin McDonald, Colette Ward

* NOTE that authors after the first author are currently in alphabetical order. This can be changed following further discussion



This is where the abstract goes.


In order to enhance discovery, efficiency and transparency in their work, scientists have begun developing a revolutionary approach to their disciplines known broadly as open science. This fledgling movement prioritizes the comprehensive, public sharing of scientific work, from initial brainstorming through final publication. As the theory of open science gains popularity within scientific institutions around the globe, many researchers are discovering a variety of difficulties in the application of this exciting new trend (citation).   Though, in this quest for full disclosure, scientists have developed a number of tools to facilitate openness at each step of the scientific process, those deeply involved in this development acknowledge that the movement is still in its adolescence (citation). 

Indispensable as a foundation of scientific inquiry, accessibility to data is an essential element within a more transparent science that faces both conceptual and practical obstacles. An historic culture of perceived ownership over one’s data, combined with an environment of competition for funding and access to publication space serves as a cultural impediment to the adoption of data sharing practices (citation).  Efforts to defend their ideas from theft, and secure publication opportunities often involve withholding data from the larger scientific community Additionally, many of our commonly used data tools neglect to prepare data for publication or sharing and therefore require additional training, effort and resources to complete this extra step. While tools have been developed to help scientists responsibly archive their data in a manner that will be accessible and preserved through time these tools and an understanding of their appropriate use are still sparsely available to scientists. William Michener hypothesizes a temporal degradation in the knowledge of one’s own data, and concludes that the older data are, the less information exists of both the data itself and associated metadata (Michener 1997). Given this conclusion it is realistic to assume that older data will inherently be less accessible than newer data even to those most in intimately associated with it. Furthermore, accelerating changes in technology should further support an increase in barriers over time, as archiving and file formats become obsoleted. 

A number of studies have tested the availability of data under open publication requirements, some dating back to the 1960s. These studies have focused on relatively small sample sizes and data acquisition efforts and have consistently recovered a mere 20-30% of the data requested (Wolins 1962, Wicherts et al. 2006, Savage & Vickers 2009, Alsheikh-Ali et al. 2011). We wanted to test whether a larger and more contemporary (make this more specific) effort would return a higher percentage of data than previous studies, and also provide documentation reflecting why data were not acquired.     

To accomplish this analysis, we decided to focus our study on the data-collection efforts of the Exxon Valdez Oil Spill Trustee Council (EVOSTC).  In 2010 the EVOSTC funded the Gulf Watch Alaska group to create an open archive of the data collected under their grants. The EVOSTC was formed following the Exxon Valdez oil spill in the Gulf of Alaska in 1989, and has funded hundreds of projects since its inception.  As a result the data we were seeking would span more than two decades. The EVOSTC required the publication of data within one year of data collection for all recipients of their grants, but did not specify or provide a publication platform. Grants provided by the group funded government agencies, private consulting firms and non-governmental organizations, as well as Alaskan native groups. The diversity of the grantees was also compounded by the variety of scientific disciplines under which they operated.  Within such a broad field of data-collection, we wanted to know for how many of these projects we could acquire data, if there were trends in data reporting based on data or grantee characteristics and, if data were not procured, why we were unsuccessful. The EVOSTC did make an effort to collect these data in the mid 1990s but the success of this effort is unknown as the content of this collection has since been lost. 


In order to assess our success, we asked: for how many EVOSTC funded projects can we recover data? Of these projects we were interested in trends in reporting, particularly, which grantees were more likely to provide data and if there was a temporal trend as predicted by Michener (1997). To test this we asked if there are differences in data reporting based on any of three project characteristics: 1) data field, 2) grantee's agency sector, and 3) age of data. Finally, we wanted to know, of the data we are unable to recover, why were data not shared. Since each funded project had an unknown number of “datasets”, we base success on the publication of at least on dataset, regardless of size or complexity. Throughout our extensive data recovery effort, we took careful notes of our communications and efforts to publicly archive acquired data for use in statistical analyses.

Data recovery and archiving:

From 2012 to 2014, a team of one full-time staff member and three part-time student interns was assigned to collect and archive data funded by the Exxon Valdez Oil Spill Trustee Council (EVOSTC) targeting specifically those projects funded between 1989 and 2010. Project information was obtained from the projects page on the EVOSTC website which includes varying levels of detail for each project, ranging from project title only to full bibliographic information and attached reports. We tracked the progress of the data request and acquisition process for each project based on six stages: “emailed”, “replied”, “sent data”, “data revised”, “published”, “unrecoverable”. Contact information was obtained through agency sites and Google searches based on the information we were able to gather from the EVOSTC site. If we were able to find contact information for the listed principal investigator (either email address or phone number), an outreach email or phone call was made explaining the data recovery project and requesting data for the project in question. Projects for which outreach could be made were labeled “emailed”. If we then received a reply to the outreach and could therefore confirm that the contact information was correct, the project label was promoted to “replied”, regardless of level of cooperation expressed in the response. Once we received data from for a project, the label was changed to “data sent”. Since it was often difficult to determine how much data were created under each project, if any data were received the project was considered successful. If the data were clean and well documented enough and/or the contact was responsive enough to guide us through any necessary data edits and metadata creation the data were labeled “data formatted”. Once the contact approved our edits and compiled metadata, the products were then published to the Gulf of Alaska Historic Data Portal and the project was labeled, “published”.

Statistical Analysis:

At the close of the data collation effort, we quantified the number of projects that fell into each of the above status labels and plotted them in a simple bar chart in order to assess the final status of our archiving undertaking.
Results of the outreach efforts are further characterized based on three project descriptors: research field of the project (field), agency sector of the home institution of the principal investigator (sector), age of data when data when recovery efforts were initiated (age). Age of data is calculated as the number of years since the last year a project received EVOSTC funding since some projects spanned multiple years.
The impacts of these characteristics on the likelihood of providing data are assessed using discrete statistical tests: Chi-squared and random forest analyses. Chi-squared tests were run on each group of characteristics individually to test for differences in success based on these three classifications. We then asked which of these characteristics were the most important in determining if the data from a project will successfully be recovered. Do assess the relative importance of the three characteristics we ran a random forest analysis using the “party” package in the R programming language. We also produced a classification tree to visualize how these characteristics affect our outcome.

Hurdles to recovery:

Also of interest were the reasons data were not recovered. To quantify these responses we categorized the projects for which we were unable to gather any data based on notes from our correspondences.  The projects included in this analysis include those labeled “emailed”, “replied” and “unrecoverable” in the status categorization. Projects in the category “replied” where instances where we received at least an initial reply confirming that we had accurate contact information but lost communication before the data was sent. These projects have been re-categorized here as communication lost. Of the “unrecoverable” projects, notes were used to divide these into: no contact information, data lost, non-digital data, unwilling to share, and requested additional funding. The projects from the “emailed” status have been added to the no contact information group since we were never able to confirm the outreach efforts actually reached the target recipient. Data are only labeled data lost if our contact confirmed that they were so. Similarly, the non-digital data, unwilling to share and requested additional funding cases are only labeled as such for projects were data requests were rejected for these confirmed reasons. Non-digital data is deemed “unrecoverable” here since our project lacked the resources to convert or store such data. We have since converted these data, when possible, and refer to these data in the discussion.