Authorea

Alyssa Goodman edited Introduction.md over 10 years ago

Commit id: 7a77ae997003f15c444b298608b89c557709f7bc

deletions | additions

Today most research projects are considered complete when a journal article based on the analysis has been written and published. Trouble is, unlike Galileo's report in _Siderius Nuncius_, the amount of real data and data description in modern publications is almost never sufficient to repeat or even statistically verify a study being presented. Worse, researchers wishing to build upon and extend work presented in the literature often have trouble recovering data associated with an article after it has been published. More often than scientists would like to admit, they cannot even recover the data associated with their own published works. Complicating the modern situation, the words "data" and "analysis"seem today to have a wider variety of definitions today than at the time of Galileo. Theoretical investigations can create large "data" sets through simulations (e.g. [The Millennium Simulation Project](http://www.mpa-garching.mpg.de/galform/virgo/millennium/)). Large scale data collection often takes place as a community-wide effort (e.g. [The Human Genome project](http://www.genome.gov/10001772)), which leads to gigantic online "databases" (organized collections of data). Today's computers are essential in simulations and in the processing of experimental and observational data that it is also often hard to draw a dividing line between "data" and "analysis" (or "code") when discussing the "care and feeding" of data. Sometimes, a copy of the code used to create or process data is so essential to using the data later that it should almost be thought of as part of the "metadata" description of a data set. Other times, the code used in a scientific study is more separable from the data, but even in those cases many of the "care and feeding" principles discussed here apply to code as well as they do to data. So how do we go about caring for and feeding data? Extra work, no doubt, is associated with nurturing your data, but care up front will save time and increase insight later. Even though modern researchers, especially in large collaborations, know that conducting research with sharing and reuse in mind is essential, it still requires a paradigm shift. Most people are still motivated by piling up publications and by getting to the next one as soon as possible. But, the more we scientists find ourselves wishing we had access to extant but now unfindable data \cite{holdren}, the more we will realize why bad data management is bad for science. How can we improve?