Alberto Pepe edited Rule 4. Publish workflow as context.md  almost 11 years ago

Commit id: 820149dc70610371a5b7e83dcb33445059a1b5cd

deletions | additions      

       

# Rule 4. Publish workflow as context.  -- they give crucial context to interpret/reuse your data products, verify Traditionally, what computer and information scientists call "workflow" has been captured in what scientists call  the "methods" and/or "analysis" section(s) of a scholarly article, where  data that you publish   Make your science sufficiently inspectable that others can judge collection, manipulation, and analysis processes are described. Today, nearly every study uses computer software to carry out  the value bulk  of your contributions. In some fields inspectable means making your work reproducible, its workflow, but rarely is the end-to-end process described  in others it means replicable or verifiable, for example. Anticipate not only a paper captured in just one software package. Thus, while directly publishing code is critical (see Rule 6), publishing a description of  your own reuse processing steps offers essential context for interpreting and re-using data.     In the future, the most useful workflow documentation will be part  of a provenance record that links together all  the data and associated software but pieces  that others may wish led  to re-run your analysis, or just a result: the data citation (Rule 2), the pointer  tounderstand what you did. Consider publishing  the code source (Rule 6), the workflow (this Rule),  and intermediate data a scholarly paper. Systems  that someone else would need to see to understand what you did.   [Mathematica notebook link example, same with python notebook, github etc.]   The logical extension of conducting science with document workflow in a way that they can plug into  provenance visions like this one are best, so keep an eye out for such systems  in mind is your field. Web services that encapsulate workflow are a good way  to share reduce the burden of software   overhead and dependencies. In life sciences, systems like Taverna and Kepler are good examples (xxrefsxx). Other standardized workflow documentation systems are offered by ``notebooks" within some software packages, such as the Mathematica and iPython notebooks (xxrefsxx). Systems (see Rule 2)  that workflow. offer hdl and doi identifiers for data can, and do, offer those identifiers for workflow files as well.  At a minimum, a simple sketch ofthe  dataflow acrossall the  software, indicating how intermediate data and final results were generated by the software, are generated,  andhas the  parameter values used in the analysis. [Aneta says: organization of workflows, examples] This includes intermediate data, esp important analysis, should be offered. Keep in mind that even if the data used are not "new," in that they come from a well-documented archive,  it is still important  to documentresults from queries to 3rd party services/data sources. A more detailed and formal way to describe the workflow is to use the W3C PROV standard. [Aneta says: the standard may not be known to people. Do we need  the standards here, or just comment and link to the examples?] Providing web services archive query  that encapsulate produced  the workflow is a good way to reduce data you used, along with all the operations you performed on  theburden of software overhead and dependencies. Organize your  data analysis work after they were retrieved.     Just as  in the way that can be shared with your collaborators. This will make a final sharing Rules 1 through 3, keeping better track  of the workflow. Use tools (electronic notebooks, for example IPython Notebook) to document your analysis process, code workflow, as context, will likely benefit you  and results. your collaborators enough to justify the loftier, more altruistic, goals espoused here.