When using script integration instead of shell commands, Snakemake automatically inserts an object giving access to all properties of the job (e.g. snakemake.output[0]
, see Fig. \ref{924627}c). This way, the boiler plate code for parsing command line arguments can be avoided.
By replacing wildcards with concrete values, Snakemake turns any rule into a job which will be executed in order to generate the defined output files. Dependencies between jobs are implicit, and inferred automatically in the following way. For each input file of a job, Snakemake determines a rule that can generate it, for example by replacing wildcards again (ambiguity can be resolved by prioritization or constraining wildcards) yielding another job. Then Snakemake goes on recursively for the latter, until all input files of all jobs are either generated by another job or already present in the used storage (e.g. on disk).
Readability
The workflow definition language of Snakemake is designed to allow maximum readability, which is crucial for transparency and adaptability. For natural language readability, the occurrence of known words is important. For example, the Dale-Chall readability formula derives a score from the fraction of potentially unknown words (which do not occur in a list of common words) among all words in a text \cite{chall_readability_1995}. For workflow definition languages, one has to additionally consider whether punctuation and operator usage is intuitively understandable. When analyzing above example workflow (Fig. \ref{924627}a) under these aspects, code statements fall into seven categories:
- a natural language word, followed by a colon (e.g.
input:
and output:
), - the word "rule", followed by a name and a colon (e.g.
rule convert_to_pdf:
), - a quoted filename pattern (e.g.
"{prefix}.pdf"
), - a quoted shell command,
- a quoted wrapper identifier,
- a quoted container URL
- a Python statement.
In addition, for each line, we can judge whether it needs
- domain knowledge (from the field analyzed in the given workflow),
- technical knowledge (e.g. about Unix-style shell commands or Python),
- Snakemake knowledge,
- general education (e.g. they should be understandable for everybody).
In Fig. \ref{924627}, we hypothesize the required knowledge for readability of each code line. Most statements are understandable with either general education, domain, or technical knowledge. In particular, only five lines need specific Snakemake knowledge (Fig. \ref{924627}d). Below, we list the rationale of our assessment for each category:
- The natural language word is either understandable with general education (e.g.
input:
and output:
) or technical knowledge (container:
or conda:
). The colon straightforwardly shows that the content follows next. Only for the wrapper directive (wrapper:
) one needs to have the Snakemake specific knowledge that it is possible to refer to publicly available tool wrappers. - The word rule is understandable with general education, and when carefully choosing rule names, at most domain knowledge is needed for understanding such statements.
- Filename patterns can mostly be understood with domain knowledge, since the file extensions should tell the expert what kind of content will be used or created. We hypothesize that wildcard definitions (e.g.
{country}
) are straightforwardly understandable as a placeholder. - Shell commands will usually need domain and technical knowledge for reception.
- Wrapper identifiers can be understood with Snakemake knowledge only, since one needs to know about the central tool wrapper repository of Snakemake. Nevertheless, with only domain knowledge one can at least conclude that the mentioned tool (last part of the wrapper ID) will be used in the wrapper.
- A container URL will usually be understandable with technical knowledge.
- Python statements will either need technical knowledge or Snakemake knowledge (when using the Snakemake API, as it happens here with the expand command, which allows to aggregate over a combination of wildcard values).
While this example is obviously not as evolved as real world data analyses, the ratio of Snakemake knowledge lines and lines that are readable with general education, domain or technical knowledge can be expected to stay roughly the same. Since Snakemake supports modularization of workflow definitions, it is moreover possible to hide away more technical parts of the workflow definition (e.g. helper functions or variables), in order to not distract the reader from understanding the main steps of the data analysis.
Since dependencies between jobs are implicitly encoded via matching filename patterns, we hypothesize that in general no specific technical knowledge is necessary to understand the connections between the rules. Instead, it should be quite intuitive to conclude that an input of one rule that reoccurs as an output of another reflects a dependency.