Authorea

Alyssa Goodman edited Introduction.md over 10 years ago

Commit id: 5c648e43fbb1f5d296f4b2fe6094025cb6ff05bb

deletions | additions

Today most research projects are considered complete when a journal article based on the analysis has been written and published. Trouble is, unlike Galileo's report in _Siderius Nuncius_, the amount of real data and data description in modern publications is almost never sufficient to repeat or even statistically verify a study being presented. Worse, researchers wishing to build upon and extend work presented in the literature often have trouble recovering data associated with an article after it has been published. More often than scientists would like to admit, they cannot even recover the data associated with their own published works. Complicating the modern situation, the words "data" and "analysis" have a wider variety of definitions today than at the time of Galileo. Theoretical investigations can create large "data" sets through simulations (e.g. [The Millennium Simulation Project](http://www.mpa-garching.mpg.de/galform/virgo/millennium/)). Large scale data collection often takes place as a community-wide effort (e.g. [The Human Genome project](http://www.genome.gov/10001772)), which leads to gigantic online "databases" (organized collections of data). Computers are so essential in simulations, and in the processing of experimental and observational data, that it is also often hard to draw a dividing line between "data" and "analysis" (or "code") when discussing the care and feeding of "data." Sometimes, a copy of the code used to create or process data is so essentialto using the use of those datalater that it the code should almost be thought of as part of the "metadata" description of a data set. the data. Other times, the code used in a scientific study is more separable from the data, but even in those cases many of the "care and feeding" principles discussed here apply to code as well as they do to data. So how do we go about caring for and feeding data? Extra work, no doubt, is associated with nurturing your data, but care up front will save time and increase insight later. Even though modern researchers, especially in large collaborations, know that conducting research with sharing and reuse in mind is essential, it still requires a paradigm shift. Most people are still motivated by piling up publications and by getting to the next one as soon as possible. But, the more we scientists find ourselves wishing we had access to extant but now unfindable data \cite{holdren}, the more we will realize why bad data management is bad for science. How can we improve?