Authorea

Alyssa Goodman edited Rule 3. Data reuse in mind.md over 10 years ago

Commit id: 01c436be39aa9ab5bf16859ffb433603a4935bb3

deletions | additions

# Rule 3. Conduct science with a particular level of reuse in mind. Data from others are hard to use without context describing what the data are and how they were obtained. Information _provenance_ refers to the sum of all of the processes, people (institutions or agents), and documents (data included!) that were involved in generating or otherwise influencing or delivering a piece of information ([W3C Provenance Group]( http://www.w3.org/TR/2013/REC-prov-dm-20130430/#dfn-provenance)). Perfect documentation of provenance is rarely, if ever, attained in scientific work today. The higher the quality of provenance information, the higher the chance of enabling data reuse. In general, data reuse is most possible when: 1) data; 2) metadata (information describing the data); and 3) information about the process of generating those data, such as code, are all provided. In trying to follow the Rules listed in this article, you will do best if you plan in advance for ways to provide all three kinds of information. **In carrying out your work, consider what level of reuse you realistically expect and plan accordingly.** Do you want your work to be fully _reproducible_? If so, then provenance information is a must (e.g., working pipeline analysis code, a platform to run it on, and verifiable versions of the data). Or do you just want your work to be _inspectable_? If so, then intermediate data products and pseudo-code may be sufficient. Or maybe your goal is that your data is_usable_ in a wide range of applications? If so, **consider adopting standard formats and metadata standards early on**. At the very least, **keep careful track of versions of data and code**, with associated dates. Taking these stepsnow as you start and carry out current projects will earn you the thanks of researchers, including you, looking back from the future.