10 Simple Rules for the Care and Feeding of Scientific Data
Alyssa Goodman, Alberto Pepe, Alexander W. Blocker,
Christine L. Borgman, Kyle Cranmer, Mercè Crosas,
Rosanne Di Stefano, Yolanda Gil, Paul Groth,
Margaret Hedstrom, David W. Hogg, Vinay Kashyap,
Ashish Mahabal, Aneta Siemiginowska, Aleksandra Slavkovic
In the early 1600s, Galileo Galilei turned a telescope toward Jupiter. In his log book each night, he drew to-scale schematic diagrams of Jupiter and some oddly-moving points of light near it. Galileo labeled each drawing with the date. Eventually he used his observations to conclude that the Earth orbits the Sun, just as the four Galilean moons orbit Jupiter. History shows Galileo to be much more than an astronomical hero, though. His clear and careful record keeping and publication style not only let Galileo understand the Solar System, it continues to let anyone understand how Galileo did it. Galileo’s notes directly integrated his data (drawings of Jupiter and its moons), key metadata (timing of each observation, weather, telescope properties), and text (descriptions of methods, analysis, and conclusions). Critically, when Galileo included the information from those notes in Siderius Nuncius (Galilei 1610), this integration of text, data and metadata was preserved, as shown in Figure 1. Galileo's work advanced the "Scientific Revolution," and his approach to observation and analysis contributed significantly to the shaping of today's modern "Scientific Method" (Galilei 1618, Drake 1957).
Today most research projects are considered complete when a journal article based on the analysis has been written and published. Trouble is, unlike Galileo's report in Siderius Nuncius, the amount of real data and data description in modern publications is almost never sufficient to repeat or even statistically verify a study being presented. Worse, researchers wishing to build upon and extend work presented in the literature often have trouble recovering data associated with an article after it has been published. More often than scientists would like to admit, they cannot even recover the data associated with their own published works.
Complicating the modern situation, the words "data" and "analysis" have a wider variety of definitions today than at the time of Galileo. Theoretical investigations can create large "data" sets through simulations (e.g. The Millennium Simulation Project). Large scale data collection often takes place as a community-wide effort (e.g. The Human Genome project), which leads to gigantic online "databases" (organized collections of data). Computers are so essential in simulations, and in the processing of experimental and observational data, that it is also often hard to draw a dividing line between "data" and "analysis" (or "code") when discussing the care and feeding of "data." Sometimes, a copy of the code used to create or process data is so essential to the use of those data that the code should almost be thought of as part of the "metadata" description of the data. Other times, the code used in a scientific study is more separable from the data, but even then, many preservation and sharing principles apply to code just as well as they do to data.
So how do we go about caring for and feeding data? Extra work, no doubt, is associated with nurturing your data, but care up front will save time and increase insight later. Even though a growing number of researchers, especially in large collaborations, know that conducting research with sharing and reuse in mind is essential, it still requires a paradigm shift. Most people are still motivated by piling up publications and by getting to the next one as soon as possible. But, the more we scientists find ourselves wishing we had access to extant but now unfindable data (Holdren 2013), the more we will realize why bad data management is bad for science. How can we improve?
This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more--but our goal here is not to review that literature. Instead, we present a short guide intended for researchers who want to know why it is important to "care for and feed" data, with some practical advice on how to do that. The set of Appendices at the close of this work offer links to the types of services referred to throughout the text. Boldface lettering below highlights actions one can take to follow the suggested rules.
Data management is a repeat-play game. If you take care to make your data easily available to others, others are more likely to do the same--eventually. While we wait for this new sharing-equilibrium to be reached, you can take two important actions. First, cherish, document, and publish your data, preferably using the robust methods described in Rule 2. Get started now, as: better tools and resources for data management are becoming more numerous; universities and research communities are moving toward bigger investments in data repositories (Rule 8); and more librarians and scientists are learning data management skills (Rule 10). At the very least, loving your own data available will serve you: you'll be able to find and reuse your own data if you treat them well. Second, enable and encourage others to cherish, document, and publish their data. If you are a research scientist, chances are that not only are you an author, but also a reviewer for a specialized journal or conference venue. As a reviewer, request that the authors of papers you review provide documentation and access to their data according to the rules set out in the remainder of this article. While institutional approaches are clearly essential (Rules 8 and 10), changing minds one scientist at a time is effective as well.
Data from others are hard to use without context describing what the data are and how they were obtained. The W3C Provenance Group defines information provenance as the sum of all of the processes, people (institutions or agents), and documents (data included!) that were involved in generating or otherwise influencing or delivering a piece of information. Perfect documentation of provenance is rarely, if ever, attained in scientific work today. The higher the quality of provenance information, the higher the chance of enabling data reuse. In general, data reuse is most possible when: 1) data; 2) metadata (information describing the data); and 3) information about the process of generating those data, such as code, are all provided. In trying to follow the Rules listed in this article, you will do best if you plan in advance for ways to provide all three kinds of information. In carrying out your work, consider what level of reuse you realistically expect and plan accordingly. Do you want your work to be fully reproducible? If so, then provenance information is a must (e.g., working pipeline analysis code, a platform to run it on, and verifiable versions of the data). Or do you just want your work to be inspectable? If so, then intermediate data products and pseudo-code may be sufficient. Or maybe your goal is that your data is usable in a wide range of applications? If so, consider adopting standard formats and metadata standards early on. At the very least, keep careful track of versions of data and code, with associated dates. Taking these steps as you plan and carry out projects will earn you the thanks of researchers, including you, looking back from the future. (Consult Appendix E for a list of tools to package all your research materials with reuse in mind)
Publishing a description of your processing steps offers essential context for interpreting and re-using data. As-such, scientists typically include a "methods" and/or "analysis" section(s) in a scholarly article, used to describe data collection, manipulation, and analysis processes. Computer and information scientists call the combination of the collection methods and analysis processes for a project its "workflow," and they consider the information used and captured in workflow to be part of the "provenance" of the data. In some cases (mostly in genomics), scientists can use existing workflow software in running experiments and in recording what was done in those experiments, e.g. Gene Pattern. In that best-case scenario, the workflow software, its version, and settings used can be published alongside data using the other rules laid out here. But, it is rare outside of genomics to see the end-to-end process described in a research paper run, orchestrated, and/or recorded by a single software package. In a plausible utopian future, automated workflow documentation could extend to all fields, so that an electronic provenance record could link together all the pieces that led to a result: the data citation (Rule 2), the pointer to the code (Rule 6), the workflow (this Rule), and a scholarly paper (Rule 5). But what can you do now? At a minimum, provide, alongside any deposit of data, a simple sketch of data flow across software, indicating how intermediate and final data products and results are generated. If it’s feasible and you are willing to deal with a higher level of complexity, also consider using an online service to encapsulate your workflow (see the Appendix C for a list of services). Keep in mind that even if the data used are not "new," in that they come from a well-documented archive, it is still important to document the archive query that produced the data you used, along with all the operations you performed on the data after they were retrieved. Keeping better track of workflow, as context, will likely benefit you and your collaborators enough to justify the loftier, more altruistic, goals espoused here.
Whether your "data" include tables, spreadsheets, images, graphs, databases and/or code, you should make as much of it as possible available with any paper that presents it. If it’s practical and helpful, share your data as early as possible in your research workflow: as soon as you are done with the analysis, even before you write any article(s) about it. Your data can even be cited before (or without) its inclusion in a paper (see Rule 7). Many journals now offer standard ways to contribute data to their archives and link it to your paper, often with a persistent identifier. Whenever possible, embed citations (links) to your data and code, each with its own persistent identifier, right into the text of your paper, just like you would reference other literature. If a journal hosting your paper doesn't offer a place for your data, and or an identifier for it, use a repository (Rule 8) and get your own identifier (Rule 2). At a minimum, you can post, and refer to, a package of files (data, codes, documentation on parameters, metadata, license information, and/or lists of links to such) with a persistent online identifier (Rule 2). And, if your domain’s journals’ policies do not allow for good data-literature interlinking, try to effect change (see Rules 1 and 10).
Did you write any code to run your analysis? No matter how buggy and insignificant you may find it, publish it. Many easy-to-use source code repositories exist, which allow not only hosting of software but also facilitate collaboration and version tracking (see Appendix D). Your code, even the shortest script (whether or not you are proud of its quality), can be an important component for understanding your data and how you got your results (Barnes 2010). Software plays several roles in relation to data and scientific research, and norms around its publication are still evolving and different across disciplines (Shamir 2013). In some cases, software is the primary data product (e.g., new algorithms). In some other cases, data are the primary research products, yet the best way to document their provenance is to publish the software that was used to generate them as "metadata." In both cases, publishing the source code and its version history is crucial to enhance transparency and reproducibility. The use of open source software when possible reduces barriers for subsequent users of your software related data products. (Prlić 2012) The same best practices discussed above in relation to data and workflow also apply to software materials: cite the software that you use and provide unique, persistent identifiers (Rule 2) to the code you share.
Chances are that you want to get credit for what you share. The attribution system used for scholarly articles, accomplished via citations, often breaks in the case of data and software. When other authors reuse or cite your data or code, you may get an acknowledgement or an incoming link. If you and your colleagues have gone to the trouble to write a "data paper," whose main purpose is to describe your data and/or code, you may also get a citation. (Chavan 2011) But, "data paper" writing is not always desirable, or relevant. So, how do you go about getting the full credit you deserve for your data and code? The best way is to simply describe your expectations on how you would like to be acknowledged. If you want, you can also release your data under a license and indicate explicitly in the paper or in the metadata how you want others to give you credit. But, while legal mechanisms have advantages, they can also inadvertently lead to limitations on the reuse of the data you are sharing. In any case, make information about you (e.g. name, institution), about the data and/or code (e.g. origin, version, associated files and metadata), and about exactly how you would like to get credit, as clear as possible. Easy-to-implement licenses, many of which offer the advantage of being machine-readable, are offered by the Creative Commons organization, as are other similar options, such as those offered by Open Data Commons. Appendix G provides more information.
Sometimes the hardest and most time-consuming step of sharing data and code is finding and deciding where to put them. Data-sharing practices vary widely across disciplines: in some fields data sharing and reuse are essential and commonplace, while in others data sharing is a "gift exchange" culture (Wallis 2013). If your community already has a standard repository, use it. If you don't know where to start looking, or you need help choosing amongst relevant repositories, ask an information specialist, such as a data scientist or a librarian working in your field (and consult the directories of data repositories listed in Appendix B). When choosing amongst repositories, try to find the one offering the best combination ease-of-deposit, community uptake, accessibility, discoverability, value-added curation, preservation infrastructure, organizational persistence, and support for the data formats and standards you use. Remember that even if your field has no domain-based repository, your institution may have one, and your local librarian or archivist can instruct you on how to use that local resource. If neither your community nor your institution has a relevant repository, try a generic repository or consider setting up your own (see Rule 2, and Appendix F).
Whether you do it in person at scientific meetings and conferences or by written communication when reviewing papers and grants, reward your colleagues who share data and code. Rally your colleagues and engage your community by providing feedback on the quality of the data assets in your field. Praise those following the best practices. The more the data created by your colleagues is accessible as an organized collection of some sort, the better your community’s research capacity. The more data get shared, used, and cited, the more they improve. Besides personal involvement and encouragement, the best way to reward data sharing is by attribution: always cite the sources of data that you use. Follow good scientific practice and give credit to those whose data you use, following their preferred reference format and according to current best practices. Standards and practices for citing and attributing data sources are actively being developed through international partnerships. (Uhlir 2012, FORCE11 2013)
As Rule 1 says, it is important not just that you love your own data, but that others love data too. An attitude that data and code are "2nd class objects," behind traditional scholarly publications is still prevalent. But, every day, as scientists try to use the frustrating but tantalizing hodgepodge of research data available via the present ad-hoc network of online systems, the value of organizing an open network of re-usable data and code is becoming more and more clear, to more and more people. You, as a scientist, need to help organize your discipline, and your institution to move more quickly to a world of open, discoverable, reproducible data and research. One important step is to advocate for hiring data specialists and for the overall support of institutional programs that improve data sharing. Make sure not only advanced researchers (e.g., postdocs) experience the pleasures of doing research with freely available data and tools: explain and show the value of well-loved data to graduate and undergraduate researchers. Teach whole courses, or mini-courses, related to caring for data and software, or incorporate the ideas into existing courses. Form groups specific to your discipline to foster data and code sharing. Hold birds-of-a-feather or special sessions during large meetings demonstrating examples where good sharing practices have led to better results and collaborations. Lead by practicing what you preach.
This article was written collaboratively, online, in the open, using Authorea. Every iteration of the writing procedure is logged and available in the online version of this article at authorea.com/3410. This article is an outcome of an Exploratory Seminar called What to Keep and How to Analyze It: Data Curation and Data Analysis with Multiple Phases (link) organized by Xiao-Li Meng and Alyssa Goodman, held on May 9-10, 2013 at the Radcliffe Institute for Advanced Study, Harvard University, Cambridge, Mass.
All the authors participated in discussions at the the exploratory seminar which led to the preparation of this article. A. Goodman and A.Pepe wrote the bulk of the article. C. Borgman, M. Crosas, K. Cranmer, R. Di Stefano, P. Groth, Y. Gill, M. Hedstrom, A. Mahabal, and A. Slavkovic contributed to the article with substantial edits. All the authors provided comments on the various stages of the article.
G. Galilei. Sidereus nuncius. (1610).
G. Galilei. The Assayer, as Translated by Stillman Drake (1957). (1618). Link
S. Drake. Discoveries and Opinions of Galileo: Including The Starry Messenger (1610), Letter to the Grand Duchess Christina (1615), and Excerpts from Letters on Sunspots (1613), The Assayer (1623). (1957). Link
J. Holdren. Increasing Public Access to the Results of Scientific Research. Memorandum of the US Office of Science and Technology, 22 February 2013 (2013). Link
J. D. Wren. URL decay in MEDLINE–a 4-year follow-up study. Bioinformatics 24, 1381-1385 (2008). Link
Nick Barnes. Publish your computer code: it is good enough. Nature 467, 753-753 (2010). Link
Lior Shamir, John F. Wallin, Alice Allen, Bruce Berriman, Peter Teuben, Robert J. Nemiroff, Jessica Mink, Robert J. Hanisch, Kimberly DuPrie. Practices in source code sharing in astrophysics. Astronomy and Computing 1, 54-58 (2013). Link
Andreas Prlić, James B. Procter. Ten Simple Rules for the Open Development of Scientific Software. PLoS Computational Biology 8, e1002802 (2012). Link
Vishwas Chavan, Lyubomir Penev. The data paper: a mechanism to incentivize data publishing in biodiversity science. BMC Bioinformatics 12, S2 (2011). Link
Jillian C. Wallis, Elizabeth Rolando, Christine L. Borgman. If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLoS ONE 8, e67332 (2013). Link
Paul E. Uhlir. For Attribution – Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. (2012).
FORCE11. Amsterdam Manifesto on Data Citation Principles. (2013). Link