Chapter 5 Target Rules

Learning Goals:

  • How to build several parallel tasks at once

5.1 Where we are now?

If you still have the hello_world rule in your Snakefile, now is a good moment to remove it. Then, your Snakefile should look something like this:

## Snakemake - MRW Replication
##
## @yourname


# --- OLS Rules --- #

rule solow_intermediate:
    input:
        script   = "src/analysis/estimate_ols_model.R",
        data     = "out/data/mrw_complete.csv",
        model    = "src/model-specs/model_solow.json",
        subset   = "src/data-specs/subset_intermediate.json"
    output:
        estimate = "out/analysis/model_solow_subset_intermediate.rds",
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"

rule solow_nonoil:
    input:
        script   = "src/analysis/estimate_ols_model.R",
        data     = "out/data/mrw_complete.csv",
        model    = "src/model-specs/model_solow.json",
        subset   = "src/data-specs/subset_nonoil.json"
    output:
        estimate = "out/analysis/model_solow_subset_nonoil.rds",
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"

rule solow_oecd:
    input:
        script   = "src/analysis/estimate_ols_model.R",
        data     = "out/data/mrw_complete.csv",
        model    = "src/model-specs/model_solow.json",
        subset   = "src/data-specs/subset_oecd.json"
    output:
        estimate = "out/analysis/model_solow_subset_oecd.rds",
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"


# --- Data Management --- #

rule gen_regression_vars:
    input:
        script = "src/data-management/gen_reg_vars.R",
        data   = "out/data/mrw_renamed.csv"
    output:
        data   = "out/data/mrw_complete.csv"
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --out {output.data}"

rule rename_vars:
    input:
        script = "src/data-management/rename_variables.R",
        data   = "src/data/mrw.dta"
    output:
        data = "out/data/mrw_renamed.csv"
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --out {output.data}"


# --- Clean Rules --- #

rule clean:
    shell:
        "rm -rf out/*"

rule clean_data:
    shell:
        "rm -rf out/data/*"

rule clean_analysis:
    shell:
        "rm -rf out/analysis/*"

5.2 How Snakemake determines the build order when multiple rules are present

Now that we have worked with multiple rules and seen how one can execute another rule, let us try to understand the principle behind this.

As we know, we can execute any rule in Snakefile explicitly by calling it by name:

snakemake --cores 1 solow_intermediate

When no rule name is explicitly given, Snakemake will execute the first rule it encounters at the top of Snakefile:

snakemake --cores 1

If you followed the order of rules in the solution to the previous exercise, solow_intermediate is the first rule in Snakefile. In this case, both the explicit and implicit commands are equivalent. You can verify this by cleaning your project before executing either or by adding --summary to either command.

The rule we ask Snakemake to execute either explicitly or implicitly is called a target rule. Snakemake focusses on executing this rule.

When all necessary inputs to build the target rule exist, Snakemake will simply execute the rule and build the defined outputs.

When Snakemake recognizes that a necessary input to execute the target rule is missing, Snakemake will try to build it through the other rules present in Snakefile.

To illustrate this, let us stick to our present project and assume the target rule is solow_intermediate. The following graph shows all rules which lead up to the target rule:

mrw_complete.csv is a necessary input to solow_intermediate. If the file does not exist, Snakemake will search through Snakefile for another rule which has mrw_complete.csv as its output – in our case gen_regression_vars. In this case, Snakemake will first execute gen_regression_vars before executing the target rule solow_intermediate.

If necessary inputs are missing to execute gen_regression_vars, Snakemake will search for other rules to produce it and so forth…

To allow Snakemake to work properly, the input and output relationships in a project need to follow a directed acyclic graph (DAG). TODO: Finish sentence. Directed here means that…. Acyclic implies that there can be no two rules which require output from one another as inputs, either directly or in a larger loop.

This has a few implications on how we should define our intermediate input and output files. We should follow the following best practices to prevent problems:

  • No rule should have the same file as input and output. It is often common practice to load a certain dataset, perform operations on it and overwrite the original file. While there are good reasons to never do this in any workflow, this behavior can lead to more severe problems when using Snakemake. Snakemake only searches for rules which can create input files which do not exist. If a rule overwrites an already existing input file, Snakemake would not recognize it as a dependency and simply ignore it. This implies that a project might run through cleanly and produce incorrect results without this being detected. In our workflow, we prevented such behavior by making sure the gen_regression_vars rule writes its results to a new file.
  • No two rules should have the same output. When Snakemake searches for a file to create a missing input file, it will execute the first rule it encounters which produces this file. To make sure our input and output relationships are explicit and reproducible, we do not want the order of the rules in a Snakefile determine which rule is actually executed.
  • A project should have a clear direction. A project should have a clear trajectory. In practice this typically starts with data manipulations which are necessary to perform analysis, which in turn will be used to create plots and tables, which finally end up in a paper or set of slides. In such a typical workflow, directedness implies for example that a regression table ist not an input in the data manipulation steps.
  • A project cannot be circular. Any circularity would let Snakemake search for rules in an infinite loop.

5.3 Dedicated target rules to execute multiple rules

Currently, our project has three rules which perform OLS regressions parallel to one another at the top of Snakefile. As only one rule can be the target, it would require us to execute Snakemake three times – once for every OLS regression:

snakemake --cores 1 solow_intermediate
snakemake --cores 1 solow_nonoil
snakemake --cores 1 solow_oecd

While either command would simplify our lives a bit by also running all the necessary data manipulations automatically, it’s still a bit silly to run Snakemake repeatedly. After all it is our goal to make our workflow reproducible via a single line of code.

Now that we understand the concept of target rules, we will try to make use of them to our advantage.

As we know, Snakemake will execute any rule which can produce an output which a target rule requires as its input. Adding a new dedicated target rule which requires the outputs of our three solow models will allow us to execute all three rules via a single line of code.

We add this rule at the top of Snakefile and name it solow_target. The rule only has three inputs, one for the output of each solow model like so:

rule solow_target:
    input:
        intermediate = "out/analysis/model_solow_subset_intermediate.rds",
        nonoil       = "out/analysis/model_solow_subset_nonoil.rds",
        oecd         = "out/analysis/model_solow_subset_oecd.rds"

After saving Snakefile, we can inspect how Snakemake perceives the state of our project with the summary option

snakemake --summary
output_file date    rule    version log-file(s) status  plan
out/analysis/model_solow_subset_intermediate.rds    Thu Feb 18 14:18:20 2021    solow_intermediate  -       ok  no update
out/analysis/model_solow_subset_nonoil.rds  -   -   -   -   missing update pending
out/analysis/model_solow_subset_oecd.rds    -   -   -   -   missing update pending
out/data/mrw_complete.csv   Thu Feb 11 16:29:24 2021    gen_regression_vars -       ok  no update
out/data/mrw_renamed.csv    Tue Feb  9 15:46:38 2021    rename_vars -       ok  no update

What does this mean? Snakemake sees the three model_solow_* outputs as files it needs to create for the solow_target rule. Out of these three files, the first, model_solow_subset_intermediate.rds already exists and is up to date after our last run above. The other two files, model_solow_subset_nonoil.rds and model_solow_subset_oecd.rds, do not yet exist and must be created. The rule column presents the rule Snakemake found to create the respective output file.

The output therefore shows, that a single run of solow_target would create all files in our project. In the following exercise, you will practice executing the project with help of our new target rule.

Exercise: Using target rules

It is time to put our new and shiny target rule to work and execute our full workflow with a single line of code.

  1. Delete the content of the out/ folder with help of the clean rule.
  2. Verify that the output folder is empty.
  3. Execute the solow_target rule and build all outputs in a single swoop.

Solution

  1. Delete the content of the out/ folder with help of the clean rule.
snakemake --cores 1 clean
  1. Verify that the output folder is empty.
ls out
Nothing to show here

This shows that the output folder is indeed cleaned properly.

  1. Execute the solow_target rule and build all outputs in a single swoop.
snakemake --cores 1 solow_target
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   gen_regression_vars
    1   rename_vars
    1   solow_intermediate
    1   solow_nonoil
    1   solow_oecd
    1   solow_target
    6

[...]

[Thu Feb 18 14:38:11 2021]
localrule solow_target:
    input: out/analysis/model_solow_subset_intermediate.rds, out/analysis/model_solow_subset_nonoil.rds, out/analysis/model_solow_subset_oecd.rds
    jobid: 0

[Thu Feb 18 14:38:11 2021]
Finished job 0.
6 of 6 steps (100%) done

The output which Snakemake prints to the screen starts with the plan Snakemake develops to generate the input files of solow_target. We see that it needs to run each of the other rules once. Snakemake then plans to execute solow_target at the end. The execution of the whole chain of rules sums up to six different task that Snakemake will execute.

The middle part of the output, which we ommit here as [...] contains the messages which would be printed to screen for each of the rules which are executed. This contains information about Snakemake’s execution as well as the console output that R would print if we execute each of the R scripts in an appropriate IDE such as Rstudio.

Finally the last part prints Snakemake’s reports about the final solow_target rule. In accordance with the rule itself, it only features input files.

The bottom message contains information about the successful execution of all six rules with a completion rate of 100%. Snakemake is done.

5.4 Target rules can do more for us

As we know, the last execution of our target rule did nothing substantial, as the rule only includes the input part of the rule.

In practice we can also use the target rule to perform other small tasks for us with the desired output.

  • When compiling LaTeX files, it is often easier to have all latex inputs in a separate folder. LaTeX is typically not very good with relatives paths and likes to create many temporary files which we probably do not want to keep. To keep things tidy, we can therefore include the cleaning of unwanted LaTeX temporary files and copy the output PDF to the main directory of our project for convenience.
  • Some of us also use the target rule to copy the output PDF to a shared folder where colleagues can access them. When you share your work in a Dropbox folder, it can be a large time saver to not copy the output PDF there manually after each update of the draft.