Authorea

Alberto Pepe slightly changed rule title as requested by reviewer 2 over 10 years ago

Commit id: 2136651f05b1b3db138af2e252642db5ba643e45

deletions | additions

# Rule 3. Conduct science with data a particular level of reuse in mind. Data from others are hard to use without context describing what the data are and how they were obtained. Information **provenance** refers to the sum of all of the processes, people (institutions or agents), and documents (data included!) that were involved in generating or otherwise influencing or delivering a piece of information ([W3C Provenance Group]( http://www.w3.org/TR/2013/REC-prov-dm-20130430/#dfn-provenance)). Perfect documentation of provenance is rarely, if ever, attained in scientific work today. The higher the quality of provenance information, the higher the chance of enabling data reuse. In general, data reuse is most possible when: 1) data; 2) metadata (information describing the data); and 3) information about the process of generating those data, such as code, are all provided. In trying to follow the Rules listed in this article, you will do best if you plan in advance for ways to provide all three kinds of information. In carrying out your work, consider what level of reuse you realistically expect and plan accordingly. Do you want your work to be fully **reproducible**? If so, then provenance information is a must (e.g., working pipeline analysis code, a platform to run it on, and verifiable versions of the data). Or do you just want your work to be **inspectable**? If so, then intermediate data products and pseudo-code may be sufficient. Or maybe your goal is that your data is **usable** in a wide range of applications? If so, consider adopting standard formats and metadata standards early on. At the very least, keeping careful track of versions of data and code, with associated dates, will be appreciated by those looking back from the future.