Chapter 4 Working With Multiple Rules

Overview

Questions

How does Snakemake link rules together?
How does Snakemake know in which order to execute rules?
How can we delete project output and build our project from scratch?

Objectives

Explain how Snakemake rules depend on each other
Practice writing Snakemake rules
Explain what a clean rule is and why we use use them

Snakefile Status

At the end of the previous Chapter, our Snakefile has the following contents:

rule rename_vars:
    input:
        script = "src/data-management/rename_variables.R",
        data   = "src/data/mrw.dta"
    output:
        data = "out/data/mrw_renamed.csv"
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --out {output.data}"

At the beginning of each chapter, we will have a copy of the Snakefile contents of the previous chapter. This will be our starting point.

In the last chapter we learned how to write and execute a single Snakemake rule. Real world research workflow typically have more than one step, and require many scripts to be executed sequentially. Next, we will learn how to work with several rules and understand how rules interact with each other.

4.1 Creating a Second Rule

The second step in our data management is to create additional variables we will use in our regression analysis. The script gen_reg_vars.R in the src/data-management folder does this for us. We are going to build a rule called gen_regression_vars to run this script using Snakemake. To investigate what the inputs and outputs might be in this new rule, let’s look at the help output from the gen_reg_vars.R script:

$ Rscript src/data-management/gen_reg_vars.R --help

The following is returned to screen:

Options:
    -d CHARACTER, --data=CHARACTER
        a csv file name

    -p NUMERIC, --param=NUMERIC
        Solow`s Constant, as a numeric [default = 0.05]

    -o CHARACTER, --out=CHARACTER
        output file name [default = out.csv]

    -h, --help
        Show this help message and exit

We see that script expects three arguments:

--data, the file path of the input file, which is suggested to be a csv.
--out, the file path of the output file.
--param, Solow’s constant which defaults to 0.05, the value used in Mankiw, Romer, and Weil (1992). We can safely ignore this for now.

Let’s construct a new rule, called gen_regression_vars that can run this script. We place it above the rule rename_vars

rule gen_regression_vars:
    input:
        script = "src/data-management/gen_reg_vars.R",
        data   = "out/data/mrw_renamed.csv"
    output:
        data   = "out/data/mrw_complete.csv"
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --out {output.data}"

Note that when the --help suggests there is a default, and we want to use that default we do not have to use this option in our Rscript call.

To find out whether Snakemake wants to run any of the rules, let’s use the --summary option:

$ snakemake --summary

which yields the following console output:

Building DAG of jobs...
output_file                    date                     rule                  version   log-file(s) status     plan
out/data/mrw_complete.csv   -                           gen_regression_vars     -                   missing update pending
out/data/mrw_renamed.csv    Fri Jan 11 13:40:07 2019    rename_vars             -                   ok          no update

Snakemake knows that the output of the rename_vars rule is up to date. We can see this by looking at the status of the output file out/data/mrw_renamed.csv which is ok. The status is ok because none of the inputs to rename_vars has changed since we last ran Snakemake. Further, we can see that plan takes the value no update. This means Snakemake won’t build this file the next time it runs.

Looking at the first line, which displays the information for our new rule gen_regression_vars we see that the status of it’s output out/data/mrw_complete.csv is missing. This means that Snakemake does not see this file in the output directory, which makes sense as we have not run this rule before. We also see that plan takes the value update pending. Because the output is missing, Snakemake will execute this rule the next time it runs. This will create the file out/data/mrw_complete.csv, which updates it (in this case the update is going from not there to being there).

Let’s run Snakemake to build our new file:

$ snakemake --cores 1

When we look at our output directory

$ ls out/data/

we see that the second output file has been created.

mrw_complete.csv  mrw_renamed.csv

Now if we run the --summary command again, we see:

$ snakemake --summary

which yields:

Building DAG of jobs...
output_file                    date                     rule                  version   log-file(s) status     plan
out/data/mrw_complete.csv   Fri Jan 11 13:45:07 2019    gen_regression_vars     -                   ok          no update
out/data/mrw_renamed.csv    Fri Jan 11 13:40:07 2019    rename_vars             -                   ok          no update

Now we see that both the output files are ok and have no update in the plan column. This means that if we were to run Snakemake again, it will not execute any of the rules.

To see that this is true:

$ snakemake --cores 1

which yields the following console output:

Building DAG of jobs...
Nothing to be done.

as expected.

4.2 Rule Order Matters

In the text above we asked you to put the rule gen_regression_vars above the rule rename_vars. What would have happened if we put it underneath?

If you try it, what you would find when you run snakemake --summary is

Building DAG of jobs...
output_file                    date                     rule                  version   log-file(s) status     plan
out/data/mrw_renamed.csv    Fri Jan 11 13:40:07 2019    rename_vars             -                   ok          no update

which has no information about the new rule gen_regression vars. The question is why?

When you run a Snakemake command without adding any information about the rule you are interested in will look for the first rule it finds in the Snakefile. If our rules are in opposite order, Snakemake finds rename_vars first (it looks from top to bottom) and thus reports information about this rule. To find out information about the second rule in the file, we would need to run the summary command as:

$ snakemake --summary gen_regression_vars

And we would get information about the output of gen_regression_vars and rename_vars. If we wanted to run the rule gen_regression_vars, we would then execute Snakemake as:

$ snakemake --cores 1 gen_regression_vars

The main takeaway here is that it is good practice to put rules that are further along the analysis pipeline higher up the Snakefile. Make sure before you proceed to the next section, gen_regression_vars is above rename_vars so you get the same output as the text.

4.3 Dependencies Between Rules

Our two rules have a dependent relationship. To be precise, gen_regression_vars depends on the rule rename_vars. This is because the output of rename_vars, out/data/mrw_renamed.csv is an input to the rule gen_regression vars. Now that we have two rules and we as researchers know there’s a dependency between them, we want to know whether Snakemake is also aware of the dependency.

How can we check this? The following argument should hold: If when we update an input of rename_vars, this would potentially change it’s output file mrw_renamed.csv. This means that Snakemake should want to run the rule rename_vars on it’s next run. mrw_renamed.csv is an input to the rule gen_regression_vars. If an input to this rule changes, so might it’s output. Then Snakemake should also want to run gen_regression vars the next time it is run too.

Let’s see whether all of this is in fact true. Using our trick from the last chapter, let’s update the time-stamp of src/data-management/rename_variables.R:

$ touch src/data-management/rename_variables.R

The input file is now newer than the output. This is implicitly the same as having new content. Now let’s check out the summary of Snakemake:

$ snakemake --summary

which prints the following to screen:

Building DAG of jobs...
output_file                    date                     rule                  version   log-file(s) status                plan
out/data/mrw_complete.csv   Fri Jan 11 13:45:07 2019    gen_regression_vars     -                   ok                    update pending
out/data/mrw_renamed.csv    Fri Jan 11 13:40:07 2019    rename_vars             -                   updated input files   update pending

We see that for the rule rename_vars the status has changed to updated input files. This makes sense because it is exactly what happened. As a result, plan changes to update pending - meaning Snakemake will execute this rule the next time it runs. Next, we look at the rule gen_regression_vars. status is still ok, but we see that plan is also update_pending so this rule will also be executed next time Snakemake runs.

This is exactly what we had hoped - Snakemake detected the relationship between the two rules. It also uses this relationship when deciding its build plan, tracing out what rules deeper down the workflow are impacted by a change in an earlier step. This is a big advantage to us as researchers - now we don’t have to manually track the depedencies between workflows. We, as authors and as researchers, feel that this benefit cannot be understated.

Let’s run Snakemake once more to get the new outputs:

$ snakemake --cores 1

which prints the following to screen:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       gen_regression_vars
        1       rename_vars
        2

...

Finished job 0.
2 of 2 steps (100%) done

and updates our output files.

Exercise: Creating Rules

Mankiw, Romer, and Weil (1992) estimate the Solow model for three subsets of data. The R script src/analysis/estimate_ols.R will estimate a OLS model for a given dataset when you provide the necessary inputs.

The script expects an OLS specification. You can point to the basic OLS specification from Mankiw, Romer, and Weil (1992) in src/model-specs/model_solow.json.
You will find the necessary data subsetting conditions in src/data-specs.

The goal is to construct one rule for each subset of the data.

What inputs do you need to provide?
What outputs will be produced?
Write Snakemake rules to estimate the solow model for each subset of data.

Solution

What inputs do you need to provide?

--data the file path of a csv dataset as an input. We use the output of our last rule: out/data/mrw_complete.csv.
--model the file path of a regression model. We follow the hint and use src/model-specs/model_solow.json.
--subset the file path of a sub setting condition. We will use one of the three files in src/data-specs for each rule respectively.
--out the file path for the output file.

What outputs will be produced?

One .rds file for each ols model. This is a R data type appropriate for saving R objects. The rule therefore outputs one R object containing the regression results for each rule. Loading the .rds file in another script will allow us later to use the regression results to create regression tables and graphs.

Write Snakemake rules to estimate the solow model for each subset of data.

Incorporating everything will give us three rules similar to these:

rule solow_intermediate:
    input:
        script   = "src/analysis/estimate_ols_model.R",
        data     = "out/data/mrw_complete.csv",
        model    = "src/model-specs/model_solow.json",
        subset   = "src/data-specs/subset_intermediate.json"
    output:
        estimate = "out/analysis/model_solow_subset_intermediate.rds",
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"

rule solow_nonoil:
    input:
        script   = "src/analysis/estimate_ols_model.R",
        data     = "out/data/mrw_complete.csv",
        model    = "src/model-specs/model_solow.json",
        subset   = "src/data-specs/subset_nonoil.json"
    output:
        estimate = "out/analysis/model_solow_subset_nonoil.rds",
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"

rule solow_oecd:
    input:
        script   = "src/analysis/estimate_ols_model.R",
        data     = "out/data/mrw_complete.csv",
        model    = "src/model-specs/model_solow.json",
        subset   = "src/data-specs/subset_oecd.json"
    output:
        estimate = "out/analysis/model_solow_subset_oecd.rds",
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"

If you want to run each of them, you need to run Snakemake three times:

$ snakemake --cores 1 solow_nonoil
$ snakemake --cores 1 solow_intermediate
$ snakemake --cores 1 solow_oecd

Do you remember why?

4.4 Clean Rules

We want to end this chapter with an extra rule which will make our life easier down the road.

So far we have built rules to move our replication of Mankiw, Romer, and Weil (1992) forward. As we continue to extend our Snakefile in the coming chapters we might want to be able to delete all of the produced outputs and make sure our projects builds cleanly from start to end.

The manual way to do this is to go to our terminal window and enter the following command each time

$ rm -rf out/*

which deletes all contents of the out folder.

To make this task a bit more comfortable, we write a “clean” rule which deletes all contents of the output folder.

We can create a rule called clean at the bottom of our Snakefile that stores the shell command from above:

rule clean:
    shell:
        "rm -rf out/*"

Note that this rule has no inputs or outputs.

To use this rule, we enter the following into our terminal:

$ snakemake --cores 1 clean

Notice that to use the clean rule we had to call the rule name, clean, explicitly. This is necessary as the rule is at the bottom of our Snakefile.

Now if we look out the output of running the summary call with snakemake, we see the following output:

snakemake --summary

Building DAG of jobs...
output_file                                        date   rule  version log-file(s)     status  plan
out/analysis/model_solow_subset_intermediate.rds    -     inter        -                missing update pending
out/analysis/model_solow_subset_nonoil.rds          -     nonoil       -                missing update pending
out/analysis/model_solow_subset_oecd.rds            -     oecd         -                missing update pending
out/data/mrw_complete.csv                           -     gen_regression_vars           missing update pending
out/data/mrw_renamed.csv                            -     rename_vars                   missing update pending

Which reveals Snakemake’s plan the next time its run will be to build all outputs.

Exercise: Creating Cleaning Rules

The cleaning rule we just created deletes everything in the out/ directory. Construct rules that would separately clean the out/data/ and out/analysis subdirectories. Why might we want to do this?

Solution

We only need to change the folders to delete in the shell command. A solution might look something like this:

rule clean_data:
    shell:
        "rm -rf out/data/*"

rule clean_analysis:
    shell:
        "rm -rf out/analysis/*"

Splitting the clean rule gives us a bit more control on what we want to refresh. In practice, data cleaning and wrangling can be computationally taxing and take a while to run through. Later in a project’s life cycle these task’s output will seldomly change. It can therefore be beneficial and sufficient to only reset only the analysis via a clean rule.

4.5 Our Workflow in Graphs

In this chapter we created four additional steps, creating the following rules:

gen_regression_vars
solow_nonoil
solow_oecd
solow_intermediate

The rules rename_vars and gen_regression vars are dependencies of the three solow rules. But each of the solow rules are currently separate worklows. This means we have three parallel workflows, as illustrated below:

Each of these workflows, that produce one regression each are very similar. We can see this by looking at the files that each rule uses:

The only difference is the model-specs file and the name of the output file. We will look to exploit this similarity in the next chapter to make our Snakefile look simpler.

B References

Mankiw, N. Gregory, David Romer, and David N. Weil. 1992. “A Contribution to the Empirics of Economic Growth*.” The Quarterly Journal of Economics 107 (2): 407–37. https://doi.org/10.2307/2118477.