Chapter 10 One rule to rule them all

We have now officially replicated all the data analysis in Mankiw, Romer, and Weil (1992). Our analysis pipeline runs via two simple commands,

  1. snakemake make_tables – to construct the regression tables,
  2. snakemake make_figures – to construct the figures

and executes also all necessary previous steps

  • two data management steps,
  • a total 24 different regressions for different subsets of the data and regression specifications.

We will now create a last target rule to unify all our analysis under a single rule to rule them all.

10.1 The State of the Snakefile


##############################################
# DICTIONARIES
##############################################

MODELS = glob_wildcards("src/model-specs/{fname}.json").fname 

# note this filter is only needed coz we are running an older version of the src files, will be updated soon
DATA_SUBSET = glob_wildcards("src/data-specs/{fname}.json").fname
DATA_SUBSET = list(filter(lambda x: x.startswith("subset"), DATA_SUBSET))

PLOTS = glob_wildcards("src/figures/{fname}.R").fname

TABLES = glob_wildcards("src/table-specs/{fname}.json").fname


##############################################
# TARGETS
##############################################

rule make_tables:
    input:
        expand("out/tables/{iTable}.tex",
                iTable = TABLES)

rule run_models:
    input:
        expand("out/analysis/{iModel}.{iSubset}.rds",
                    iSubset = DATA_SUBSET,
                    iModel = MODELS)

rule make_figures:
    input:
        expand("out/figures/{iFigure}.pdf",
                    iFigure = PLOTS)


##############################################
# INTERMEDIATE RULES
##############################################

rule table:
    input:
        script = "src/tables/regression_table.R",
        spec   = "src/table-specs/{iTable}.json",
        models = expand("out/analysis/{iModel}.{iSubset}.rds",
                        iModel = MODELS,
                        iSubset = DATA_SUBSET),
    output:
        table = "out/tables/{iTable}.tex"
    shell:
        "Rscript {input.script} \
            --spec {input.spec} \
            --out {output.table}"

rule model:
    input:
        script = "src/analysis/estimate_ols_model.R",
        data   = "out/data/mrw_complete.csv",
        model  = "src/model-specs/{iModel}.json",
        subset = "src/data-specs/{iSubset}.json"
    output:
        estimate = "out/analysis/{iModel}.{iSubset}.rds"
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --model {input.model} \
            --subset {input.subset} \
            --out {output.estimate}"

rule figure:
    input:
        script = "src/figures/{iFigure}.R",
        data   = "out/data/mrw_complete.csv",
        subset = "src/data-specs/subset_intermediate.json"
    output:
        fig = "out/figures/{iFigure}.pdf"
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --subset {input.subset} \
            --out {output.fig}"

rule gen_regression_vars:
    input:
        script = "src/data-management/gen_reg_vars.R",
        data   = "out/data/mrw_renamed.csv",
        param  = "src/data-specs/param_solow.json"
    output:
        data   = "out/data/mrw_complete.csv"
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --param {input.param} \
            --out {output.data}"

rule rename_vars:
    input:
        script = "src/data-management/rename_variables.R",
        data   = "src/data/mrw.dta"
    output:
        data = "out/data/mrw_renamed.csv"
    shell:
        "Rscript {input.script} \
            --data {input.data} \
            --out {output.data}"


##############################################
# CLEANING RULES
##############################################

rule clean:
    shell:
        "rm -rf out/*"

rule clean_data:
    shell:
        "rm -rf out/data/*"

rule clean_analysis:
    shell:
        "rm -rf out/analysis/*"

10.2 Creating an all target rule

Our all rule unifies the make_figures and make_tables target rules.

Let’s take a look at the structure of each of these rules so that we get a sense of what we might need to do.

Here is the make_figures rule:

rule make_figures:
    input:
        expand("out/figures/{iFigure}.pdf",
                iFigure = FIGURES)

And the make_tables rule:

rule make_tables:
    input:
        expand("out/tables/{iTable}.tex",
                iTable = TABLES)

Both rules only have expanded files as inputs. To unite both under a single rule, it therefore enough to add all figure and table inputs of these rules under our new rule.

This will give us something like this:

rule all:
    input:
                expand("out/figures/{iFigure}.pdf",
                    iFigure = PLOTS),
                expand("out/tables/{iTable}.tex",
                    iTable = TABLES)    

It is best to put this rule as the first rule into Snakefile. Making it the first rule in the TARGET RULE section above is a good place for this.

Before continuing, let us clean our project folder once more via

snakemake --cores 1 clean

Having the all rule as the first rule in Snakefile allows us to execute our entire project implicitly via:

snakemake --cores 1
Building DAG of jobs...
Job counts:
    count   jobs
    1   all
    3   figure
    1   gen_regression_vars
    24  model
    1   rename_vars
    6   table
    36

[...]

[Sun Feb 21 22:16:41 2021]
localrule all:
    input: out/figures/aug_conditional_convergence.pdf,  
    out/figures/conditional_convergence.pdf, out/figures/  unconditional_convergence.pdf,  
    out/tables/table_06.tex, out/tables/table_01.tex, out/tables/table_02.tex,  
    out/tables/table_03.tex, out/tables/table_04.tex, out/tables/table_05.tex  
    jobid: 0

[Sun Feb 21 22:16:41 2021]  
Finished job 0.  
36 of 36 steps (100%) done

And that’s it! Our rule has executed all 36 steps necessary to replicate the entire analysis of Mankiw, Romer, and Weil (1992).

If we examine the full Snakemake output which we omitted here, we see the order in which Snakemake executed the jobs:

  1. First the data management steps are executed to prepare the data
    • This is what we would expect, as we cannot do any analysis without cleaned data
  2. Then all figures are produced
    • Figures are the first input to the all rule and therefore produced first.
  3. Next all tables are produced
    • They are the final outputs to the all rule
  4. Last the all rule is executed
    • As a good target rule, this does nothing substantial.

We note three more things:

  1. The make_figures and make_tables target rules are not executed. The all rule sits parallel to both (has the same inputs as them) and does not require their output (they do not produce any, after all).
  2. The order of jobs in the count statistic at the beginning of the output does not reflect the actual build order.
  3. Some R console outputs are mixed in between the Snakemake messages. Essentially this is caused by our terminal application being too slow in printing messages to keep up with all the computational steps in the background. This does not reflect the actual steps, which are of course aligned and can be safely ignored.

10.3 Concluding Remarks

We have now created a fully reproducible replication workflow. With a single line of code we can execute the whole analysis of Mankiw, Romer, and Weil (1992).

It is now also a breeze to make modifications to the analysis. Snakemake and the all rule allow us to simply re-run all modified steps and all downstream operations that depend on them so that all results are updated accordingly. We can even add additional models, regression tables or figures without having to remember to account for them in our build script. Using glob_wildcards will always ensure that all additional analysis is performed.

This state of our project is a natural end point for any researcher or research assistant who performs the data analysis in a project with colleagues who do not use Snakemake themselves. In such a workflow the data analysis and drafting stages of slides or a research paper are often separate processes.

For a single researcher or a group of researchers, who all work jointly with Snakemake, we have created the next chapter. There we will also automate the generation of a research paper draft and presentation slides in a single workflow together with the data analysis.

B References

Mankiw, N. Gregory, David Romer, and David N. Weil. 1992. “A Contribution to the Empirics of Economic Growth*.” The Quarterly Journal of Economics 107 (2): 407–37. https://doi.org/10.2307/2118477.