Chapter 5 Target Rules
Learning Goals:
- How to build several parallel tasks at once
5.1 Where we are now?
If you still have the hello_world
rule in your Snakefile, now is a good moment to remove it.
Then, your Snakefile should look something like this:
## Snakemake - MRW Replication
##
## @yourname
# --- OLS Rules --- #
rule solow_intermediate:
input:
script = "src/analysis/estimate_ols_model.R",
data = "out/data/mrw_complete.csv",
model = "src/model-specs/model_solow.json",
subset = "src/data-specs/subset_intermediate.json"
output:
estimate = "out/analysis/model_solow_subset_intermediate.rds",
shell:
"Rscript {input.script} \
--data {input.data} \
--model {input.model} \
--subset {input.subset} \
--out {output.estimate}"
rule solow_nonoil:
input:
script = "src/analysis/estimate_ols_model.R",
data = "out/data/mrw_complete.csv",
model = "src/model-specs/model_solow.json",
subset = "src/data-specs/subset_nonoil.json"
output:
estimate = "out/analysis/model_solow_subset_nonoil.rds",
shell:
"Rscript {input.script} \
--data {input.data} \
--model {input.model} \
--subset {input.subset} \
--out {output.estimate}"
rule solow_oecd:
input:
script = "src/analysis/estimate_ols_model.R",
data = "out/data/mrw_complete.csv",
model = "src/model-specs/model_solow.json",
subset = "src/data-specs/subset_oecd.json"
output:
estimate = "out/analysis/model_solow_subset_oecd.rds",
shell:
"Rscript {input.script} \
--data {input.data} \
--model {input.model} \
--subset {input.subset} \
--out {output.estimate}"
# --- Data Management --- #
rule gen_regression_vars:
input:
script = "src/data-management/gen_reg_vars.R",
data = "out/data/mrw_renamed.csv"
output:
data = "out/data/mrw_complete.csv"
shell:
"Rscript {input.script} \
--data {input.data} \
--out {output.data}"
rule rename_vars:
input:
script = "src/data-management/rename_variables.R",
data = "src/data/mrw.dta"
output:
data = "out/data/mrw_renamed.csv"
shell:
"Rscript {input.script} \
--data {input.data} \
--out {output.data}"
# --- Clean Rules --- #
rule clean:
shell:
"rm -rf out/*"
rule clean_data:
shell:
"rm -rf out/data/*"
rule clean_analysis:
shell:
"rm -rf out/analysis/*"
5.2 How Snakemake determines the build order when multiple rules are present
Now that we have worked with multiple rules and seen how one can execute another rule, let us try to understand the principle behind this.
As we know, we can execute any rule in Snakefile
explicitly by calling it by name:
When no rule name is explicitly given, Snakemake will execute the first rule it encounters at the top of Snakefile
:
If you followed the order of rules in the solution to the previous exercise, solow_intermediate
is the first rule in Snakefile
.
In this case, both the explicit and implicit commands are equivalent.
You can verify this by cleaning your project before executing either or by adding --summary
to either command.
The rule we ask Snakemake
to execute either explicitly or implicitly is called a target rule.
Snakemake focusses on executing this rule.
When all necessary inputs to build the target rule exist, Snakemake will simply execute the rule and build the defined outputs.
When Snakemake recognizes that a necessary input to execute the target rule is missing, Snakemake will try to build it through the other rules present in Snakefile
.
To illustrate this, let us stick to our present project and assume the target rule is solow_intermediate
.
The following graph shows all rules which lead up to the target rule:
mrw_complete.csv
is a necessary input to solow_intermediate
.
If the file does not exist, Snakemake
will search through Snakefile
for another rule which has mrw_complete.csv
as its output – in our case gen_regression_vars
.
In this case, Snakemake will first execute gen_regression_vars
before executing the target rule solow_intermediate
.
If necessary inputs are missing to execute gen_regression_vars
, Snakemake will search for other rules to produce it and so forth…
To allow Snakemake to work properly, the input and output relationships in a project need to follow a directed acyclic graph (DAG). TODO: Finish sentence. Directed here means that…. Acyclic implies that there can be no two rules which require output from one another as inputs, either directly or in a larger loop.
This has a few implications on how we should define our intermediate input and output files. We should follow the following best practices to prevent problems:
- No rule should have the same file as input and output. It is often common practice to load a certain dataset, perform operations on it and overwrite the original file. While there are good reasons to never do this in any workflow, this behavior can lead to more severe problems when using Snakemake. Snakemake only searches for rules which can create input files which do not exist. If a rule overwrites an already existing input file, Snakemake would not recognize it as a dependency and simply ignore it. This implies that a project might run through cleanly and produce incorrect results without this being detected. In our workflow, we prevented such behavior by making sure the
gen_regression_vars
rule writes its results to a new file. - No two rules should have the same output. When Snakemake searches for a file to create a missing input file, it will execute the first rule it encounters which produces this file. To make sure our input and output relationships are explicit and reproducible, we do not want the order of the rules in a
Snakefile
determine which rule is actually executed. - A project should have a clear direction. A project should have a clear trajectory. In practice this typically starts with data manipulations which are necessary to perform analysis, which in turn will be used to create plots and tables, which finally end up in a paper or set of slides. In such a typical workflow, directedness implies for example that a regression table ist not an input in the data manipulation steps.
- A project cannot be circular. Any circularity would let Snakemake search for rules in an infinite loop.
5.3 Dedicated target rules to execute multiple rules
Currently, our project has three rules which perform OLS regressions parallel to one another at the top of Snakefile
.
As only one rule can be the target, it would require us to execute Snakemake three times – once for every OLS regression:
snakemake --cores 1 solow_intermediate
snakemake --cores 1 solow_nonoil
snakemake --cores 1 solow_oecd
While either command would simplify our lives a bit by also running all the necessary data manipulations automatically, it’s still a bit silly to run Snakemake repeatedly. After all it is our goal to make our workflow reproducible via a single line of code.
Now that we understand the concept of target rules
, we will try to make use of them to our advantage.
As we know, Snakemake will execute any rule which can produce an output which a target rule requires as its input. Adding a new dedicated target rule which requires the outputs of our three solow models will allow us to execute all three rules via a single line of code.
We add this rule at the top of Snakefile
and name it solow_target
.
The rule only has three inputs, one for the output of each solow model like so:
rule solow_target:
input:
intermediate = "out/analysis/model_solow_subset_intermediate.rds",
nonoil = "out/analysis/model_solow_subset_nonoil.rds",
oecd = "out/analysis/model_solow_subset_oecd.rds"
After saving Snakefile
, we can inspect how Snakemake perceives the state of our project with the summary option
output_file date rule version log-file(s) status plan
out/analysis/model_solow_subset_intermediate.rds Thu Feb 18 14:18:20 2021 solow_intermediate - ok no update
out/analysis/model_solow_subset_nonoil.rds - - - - missing update pending
out/analysis/model_solow_subset_oecd.rds - - - - missing update pending
out/data/mrw_complete.csv Thu Feb 11 16:29:24 2021 gen_regression_vars - ok no update
out/data/mrw_renamed.csv Tue Feb 9 15:46:38 2021 rename_vars - ok no update
What does this mean?
Snakemake sees the three model_solow_*
outputs as files it needs to create for the solow_target
rule.
Out of these three files, the first, model_solow_subset_intermediate.rds
already exists and is up to date after our last run above.
The other two files, model_solow_subset_nonoil.rds
and model_solow_subset_oecd.rds
, do not yet exist and must be created.
The rule column presents the rule Snakemake found to create the respective output file.
The output therefore shows, that a single run of solow_target
would create all files in our project.
In the following exercise, you will practice executing the project with help of our new target rule.
Exercise: Using target rules
It is time to put our new and shiny target rule to work and execute our full workflow with a single line of code.
- Delete the content of the
out/
folder with help of theclean
rule. - Verify that the output folder is empty.
- Execute the
solow_target
rule and build all outputs in a single swoop.
Solution
- Delete the content of the
out/
folder with help of theclean
rule.
- Verify that the output folder is empty.
This shows that the output folder is indeed cleaned properly.
- Execute the
solow_target
rule and build all outputs in a single swoop.
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 gen_regression_vars
1 rename_vars
1 solow_intermediate
1 solow_nonoil
1 solow_oecd
1 solow_target
6
[...]
[Thu Feb 18 14:38:11 2021]
localrule solow_target:
input: out/analysis/model_solow_subset_intermediate.rds, out/analysis/model_solow_subset_nonoil.rds, out/analysis/model_solow_subset_oecd.rds
jobid: 0
[Thu Feb 18 14:38:11 2021]
Finished job 0.
6 of 6 steps (100%) done
The output which Snakemake prints to the screen starts with the plan Snakemake develops to generate the input files of solow_target
.
We see that it needs to run each of the other rules once.
Snakemake then plans to execute solow_target
at the end.
The execution of the whole chain of rules sums up to six different task that Snakemake will execute.
The middle part of the output, which we ommit here as [...]
contains the messages which would be printed to screen for each of the rules which are executed.
This contains information about Snakemake’s execution as well as the console output that R would print if we execute each of the R scripts in an appropriate IDE such as Rstudio
.
Finally the last part prints Snakemake’s reports about the final solow_target
rule.
In accordance with the rule itself, it only features input files.
The bottom message contains information about the successful execution of all six rules with a completion rate of 100%. Snakemake is done.
5.4 Target rules can do more for us
As we know, the last execution of our target rule did nothing substantial, as the rule only includes the input
part of the rule.
In practice we can also use the target rule to perform other small tasks for us with the desired output.
- When compiling LaTeX files, it is often easier to have all latex inputs in a separate folder. LaTeX is typically not very good with relatives paths and likes to create many temporary files which we probably do not want to keep. To keep things tidy, we can therefore include the cleaning of unwanted LaTeX temporary files and copy the output PDF to the main directory of our project for convenience.
- Some of us also use the target rule to copy the output PDF to a shared folder where colleagues can access them. When you share your work in a Dropbox folder, it can be a large time saver to not copy the output PDF there manually after each update of the draft.