10 Simple Rules for the Care and Feeding of Scientific Data

Alyssa Goodman, Alberto Pepe, Alexander W. Blocker,
Christine L. Borgman, Kyle Cranmer, Mercè Crosas,
Rosanne Di Stefano, Yolanda Gil, Paul Groth,
Margaret Hedstrom, David W. Hogg, Vinay Kashyap,
Ashish Mahabal, Aneta Siemiginowska, Aleksandra Slavkovic

Introduction

In the early 1600s, Galileo Galilei turned a telescope toward Jupiter.
In his log book each night, he drew to-scale schematic diagrams of
Jupiter and some oddly-moving points of light near it. Galileo labeled
each drawing with the date. Eventually he used his observations to
conclude that the Earth orbits the Sun, just as the four Galilean moons orbit
Jupiter. History shows Galileo to be much more than an astronomical hero,
though. His clear and careful record keeping and publication style not
only let Galileo understand the Solar System, it continues to let anyone
understand how Galileo did it. Galileo’s notes directly integrated his data (drawings of Jupiter and its
moons), key metadata (timing of each observation, weather, telescope
properties), and text (descriptions of methods, analysis, and conclusions). Critically, when Galileo included the information from those notes in Siderius Nuncius (Galilei 1610), this integration of text, data and metadata was preserved, as shown in Figure 1. Galileo's work advanced the "Scientific Revolution," and his approach to observation and analysis contributed significantly to the shaping of today's modern "Scientific Method" (Galilei 1618, Drake 1957).

Today most research projects are considered complete when a journal article based on the analysis has been written and published. Trouble is, unlike Galileo's report in Siderius Nuncius, the amount of real data and data description in modern publications is almost never sufficient to repeat or even statistically verify a study being presented. Worse, researchers wishing to build upon and extend work presented in the literature often have trouble recovering data associated with an article after it has been published. More often than scientists would like to admit, they cannot even recover the data associated with their own published works.

Complicating the modern situation, the words "data" and "analysis" have a wider variety of definitions today than at the time of Galileo. Theoretical investigations can create large "data" sets through simulations (e.g. The Millennium Simulation Project). Large scale data collection often takes place as a community-wide effort (e.g. The Human Genome project), which leads to gigantic online "databases" (organized collections of data). Computers ar