Chapter 6 Wildcards

Learning Goals:

  • How to minimize repetition in the code

Adding a target rule to execute all regressions at the same time is good progress to simplify the execution of our project. In this chapter we want to make our life easier to write and maintain our project.

If we look at the solow_ rules we see that there is quite a lot of replication of code between them. All three rules use the same script, data, and model inputs. They only differ in the data subsetting found under {input.subset} and the output file define under {output.estimate}.

Ideally, we want Snakefile to feature the minimum amount of duplication possible.

This has a few advantages:

  • Limiting redundancy allows us to make changes to the code only once rather than multiple times (imagine changing the {data.input} file).
  • Any additional copy of code introduces extra opportunities of typos or other errors and makes the build file generally harder to read and navigate.
  • And if we’re completely honest, it’s also a just a bit ugly…

While maintaining duplicate code for three rules is still bearable, redundancy becomes a larger problem when we have many duplicate rule executions. Imagine managing 15 OLS specifications instead of three.

In this chapter we will learn how to unify such redundancies and collapse the three rules solow_nonoil, solow_oecd, and solow_intermediate into a single rule. To do so, we create a new variable, iSubset, that can iterate through the three .json files that contain the subset filters. In Snakemake, these variables are called wildcards.

6.1 The starting point of our Snakefile

The top of our Snakefile at the moment looks something like this:


# --- Build Rules --- #

rule solow_target:
    input:
        intermediate = "out/analysis/model_solow_subset_intermediate.rds",
        nonoil       = "out/analysis/model_solow_subset_nonoil.rds",
        oecd         = "out/analysis/model_solow_subset_oecd.rds"

rule solow_intermediate:
    input:
        script   = "src/analysis/estimate_ols_model.R",
        data     = "out/data/mrw_complete.csv",
        model    = "src/model-specs/model_solow.json",
        subset   = "src/data-specs/subset_intermediate.json"
    output:
        estimate = "out/analysis/model_solow_subset_intermediate.rds",
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"

rule solow_nonoil:
    input:
        script   = "src/analysis/estimate_ols_model.R",
        data     = "out/data/mrw_complete.csv",
        model    = "src/model-specs/model_solow.json",
        subset   = "src/data-specs/subset_nonoil.json"
    output:
        estimate = "out/analysis/model_solow_subset_nonoil.rds",
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"

rule solow_oecd:
    input:
        script   = "src/analysis/estimate_ols_model.R",
        data     = "out/data/mrw_complete.csv",
        model    = "src/model-specs/model_solow.json",
        subset   = "src/data-specs/subset_oecd.json"
    output:
        estimate = "out/analysis/model_solow_subset_oecd.rds",
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"

[...]

where [...] stands for the bottom part of our Snakefile that we will not modify in this chapter.

6.2 Using wildcards to tidy up our code

Let us now replace the three solow_ rules by a single rule which we call solow_model where the unique parts of the {input.subset} and {output.estimate} are replaced with our wildcard variable {iSubset}.

The relevant part now looks something like this:


rule solow_target:
    input:
        intermediate = "out/analysis/model_solow_subset_intermediate.rds",
        nonoil       = "out/analysis/model_solow_subset_nonoil.rds",
        oecd         = "out/analysis/model_solow_subset_oecd.rds"

rule solow_model:
    input:
        script = "src/analysis/estimate_ols_model.R",
        data   = "out/data/mrw_complete.csv",
        model  = "src/model-specs/model_solow.json",
        subset = "src/data-specs/subset_{iSubset}.json"
    output:
        estimate = "out/analysis/model_solow_subset_{iSubset}.rds",
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"

[...]

We notice right away how much more concise the relevant part of our script has become. With a bit of practice it is also much easier to read and understand what these rules do.

We wrapped our wildcard iSubset in curly parentheses so that Snakemake knows that this part is a variable which we want to substitute with the name of one of the subsets. This is conceptually similar to what we have done in our shell commands.

Let us clean our output folder again via

snakemake --cores 1 clean

and try to execute solow_model as a target rule:

$ snakemake --cores 1 solow_model
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.

What has happened? Snakemake will not execute a rule that contains wildcards because it does not know what values to substitute into iSubset.

Gladly, we already have a target rule which does not contain wildcards and explicitly specifies the input files we want to create, our trusty run_solow rule.

Let us therefore try to execute our target rule instead:

$ snakemake --cores 1 solow_target
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   gen_regression_vars
    1   rename_vars
    3   solow_model
    1   solow_target
    6

[...]

[Thu Feb 18 19:50:04 2021]
localrule solow_target:
    input: out/analysis/model_solow_subset_intermediate.rds, out/analysis/model_solow_subset_nonoil.rds, out/analysis/model_solow_subset_oecd.rds
    jobid: 0

[Thu Feb 18 19:50:04 2021]
Finished job 0.
6 of 6 steps (100%) done

Nice! Our error is gone and we have produced the same output as before an a much more manageable manner.

We see one difference to our execution from the previous chapter in the printed job counts at the beginning. Instead of running all solow_* rules once, Snakemake sees that it needs to execute the solo_model rule instead three times — once for every subset.

We might ask ourselves how the wildcard approach has worked. Essentially, Snakemake knows it needs to create the three outputs model_solow_subset_intermediate.rds, model_solow_subset_nonoil.rds, and model_solow_subset_oecd.rds". To so so it looks for other rules which can produce these files. Snakemake finds the solow_model rule which can produce these files as its output when {iSubject} is replaced with intermediate, nonoil, and oecd respectively.

However, for each execution of the rule it needs to fill {iSubject} with the same value in both the output and input part. Snakemake therefore then checks whether it can find the appropriate input files when filling {iSubject} with the corresponding value. When these files exist for the same wildcard value, it will take the corresponding input file to execute the rule.