Authorea

Alberto Pepe edited Rule 3. Data reuse in mind.md almost 11 years ago

Commit id: e33a9736274060c463eba00300d2936b4b8bb9bf

deletions | additions

# Rule 3. Conduct science with data reuse in mind. Data re-use is most possible when: 1) data; 2) metadata (information describing the data); and 3) information about the process of generating those data, are all provided. In trying to follow the Rules listed here, you will do best if you plan in advance for ways to provide all three kinds of information. Information "provenance" is the sum of all of the processes and people (or institutions or other agents) and documents (data included!) that were involved in generating or otherwise influencing or delivering a piece of information [W3CProvenance Group]( http://www.w3.org/2005/Incubator/prov/wiki/What\_Is\_Provenance\#A\_Working\_Definition\_of\_Provenance). Perfect documentation of provenance is rarely, if ever, attained in scientific work today. The higher the quality of provenance information, the higher the chance of enabling data re-use. In general data re-use is most possible when: 1) data; 2) metadata (information describing the data); and 3) information about the process of generating those data, such as code, are all provided. In trying to follow the Rules listed in this article, you will do best if you plan in advance for ways to provide all three kinds of information. But, in carrying out your work, you can consider what level of re-use you realistically expect, and plan accordingly. Do you want your work to be fully **reproducible**? If so, then formalized full provenance information is recommended a must (e.g. working pipeline analysis code, a machine to run it on, and raw data). Do you just want your work to be **inspectable**? If so, then intermediate data products and/or schematic code and pseudo-code may be sufficient. Do you want your data to be **usable** in a wide range of applications? If so, consider adopting standard formats and metadata standards early on. At the very least, keeping careful track of versions of data and code, with associated dates, will be appreciated by those looking back from the future.Future use of data is always impacted by the level of detail available about assumptions that were made in collecting and processing it. In planning for exactly which data, and/or level of data, to share, consider: cost, privacy, statistical efficiency, and simplicity. Keep in mind that data reduction and summary statistics always limit the scope of future analysis. For example, the mean and standard deviation are sufficient information for a normal distribution, but not for a more general statistical model. When applying encryption and statistical methods to reduce disclosure risk of sensitive information, the use of redacted data can even lead to false inference (for example: \cite{lexander_Davern_Stevenson_2010}). These tradeoffs are unavoidable, so think about keeping and providing data products at multiple stages in the processing spectrum if possible.