Chapter 10 One rule to rule them all
We have now officially replicated all the data analysis in Mankiw, Romer, and Weil (1992). Our analysis pipeline runs via two simple commands,
snakemake make_tables
– to construct the regression tables,snakemake make_figures
– to construct the figures
and executes also all necessary previous steps
- two data management steps,
- a total 24 different regressions for different subsets of the data and regression specifications.
We will now create a last target rule to unify all our analysis under a single rule to rule them all.
10.1 The State of the Snakefile
##############################################
# DICTIONARIES
##############################################
MODELS = glob_wildcards("src/model-specs/{fname}.json").fname
# note this filter is only needed coz we are running an older version of the src files, will be updated soon
DATA_SUBSET = glob_wildcards("src/data-specs/{fname}.json").fname
DATA_SUBSET = list(filter(lambda x: x.startswith("subset"), DATA_SUBSET))
PLOTS = glob_wildcards("src/figures/{fname}.R").fname
TABLES = glob_wildcards("src/table-specs/{fname}.json").fname
##############################################
# TARGETS
##############################################
rule make_tables:
input:
expand("out/tables/{iTable}.tex",
iTable = TABLES)
rule run_models:
input:
expand("out/analysis/{iModel}.{iSubset}.rds",
iSubset = DATA_SUBSET,
iModel = MODELS)
rule make_figures:
input:
expand("out/figures/{iFigure}.pdf",
iFigure = PLOTS)
##############################################
# INTERMEDIATE RULES
##############################################
rule table:
input:
script = "src/tables/regression_table.R",
spec = "src/table-specs/{iTable}.json",
models = expand("out/analysis/{iModel}.{iSubset}.rds",
iModel = MODELS,
iSubset = DATA_SUBSET),
output:
table = "out/tables/{iTable}.tex"
shell:
"Rscript {input.script} \
--spec {input.spec} \
--out {output.table}"
rule model:
input:
script = "src/analysis/estimate_ols_model.R",
data = "out/data/mrw_complete.csv",
model = "src/model-specs/{iModel}.json",
subset = "src/data-specs/{iSubset}.json"
output:
estimate = "out/analysis/{iModel}.{iSubset}.rds"
shell:
"Rscript {input.script} \
--data {input.data} \
--model {input.model} \
--subset {input.subset} \
--out {output.estimate}"
rule figure:
input:
script = "src/figures/{iFigure}.R",
data = "out/data/mrw_complete.csv",
subset = "src/data-specs/subset_intermediate.json"
output:
fig = "out/figures/{iFigure}.pdf"
shell:
"Rscript {input.script} \
--data {input.data} \
--subset {input.subset} \
--out {output.fig}"
rule gen_regression_vars:
input:
script = "src/data-management/gen_reg_vars.R",
data = "out/data/mrw_renamed.csv",
param = "src/data-specs/param_solow.json"
output:
data = "out/data/mrw_complete.csv"
shell:
"Rscript {input.script} \
--data {input.data} \
--param {input.param} \
--out {output.data}"
rule rename_vars:
input:
script = "src/data-management/rename_variables.R",
data = "src/data/mrw.dta"
output:
data = "out/data/mrw_renamed.csv"
shell:
"Rscript {input.script} \
--data {input.data} \
--out {output.data}"
##############################################
# CLEANING RULES
##############################################
rule clean:
shell:
"rm -rf out/*"
rule clean_data:
shell:
"rm -rf out/data/*"
rule clean_analysis:
shell:
"rm -rf out/analysis/*"
10.2 Creating an all
target rule
Our all
rule unifies the make_figures
and make_tables
target rules.
Let’s take a look at the structure of each of these rules so that we get a sense of what we might need to do.
Here is the make_figures
rule:
And the make_tables
rule:
Both rules only have expanded files as inputs. To unite both under a single rule, it therefore enough to add all figure and table inputs of these rules under our new rule.
This will give us something like this:
rule all:
input:
expand("out/figures/{iFigure}.pdf",
iFigure = PLOTS),
expand("out/tables/{iTable}.tex",
iTable = TABLES)
It is best to put this rule as the first rule into Snakefile
.
Making it the first rule in the TARGET RULE
section above is a good place for this.
Before continuing, let us clean our project folder once more via
Having the all
rule as the first rule in Snakefile
allows us to execute our entire project implicitly via:
Building DAG of jobs...
Job counts:
count jobs
1 all
3 figure
1 gen_regression_vars
24 model
1 rename_vars
6 table
36
[...]
[Sun Feb 21 22:16:41 2021]
localrule all:
input: out/figures/aug_conditional_convergence.pdf,
out/figures/conditional_convergence.pdf, out/figures/ unconditional_convergence.pdf,
out/tables/table_06.tex, out/tables/table_01.tex, out/tables/table_02.tex,
out/tables/table_03.tex, out/tables/table_04.tex, out/tables/table_05.tex
jobid: 0
[Sun Feb 21 22:16:41 2021]
Finished job 0.
36 of 36 steps (100%) done
And that’s it! Our rule has executed all 36 steps necessary to replicate the entire analysis of Mankiw, Romer, and Weil (1992).
If we examine the full Snakemake output which we omitted here, we see the order in which Snakemake executed the jobs:
- First the data management steps are executed to prepare the data
- This is what we would expect, as we cannot do any analysis without cleaned data
- Then all figures are produced
- Figures are the first input to the
all
rule and therefore produced first.
- Figures are the first input to the
- Next all tables are produced
- They are the final outputs to the
all
rule
- They are the final outputs to the
- Last the
all
rule is executed- As a good target rule, this does nothing substantial.
We note three more things:
- The
make_figures
andmake_tables
target rules are not executed. Theall
rule sits parallel to both (has the same inputs as them) and does not require their output (they do not produce any, after all). - The order of jobs in the count statistic at the beginning of the output does not reflect the actual build order.
- Some R console outputs are mixed in between the Snakemake messages. Essentially this is caused by our terminal application being too slow in printing messages to keep up with all the computational steps in the background. This does not reflect the actual steps, which are of course aligned and can be safely ignored.
10.3 Concluding Remarks
We have now created a fully reproducible replication workflow. With a single line of code we can execute the whole analysis of Mankiw, Romer, and Weil (1992).
It is now also a breeze to make modifications to the analysis.
Snakemake and the all
rule allow us to simply re-run all modified steps and all downstream operations that depend on them so that all results are updated accordingly.
We can even add additional models, regression tables or figures without having to remember to account for them in our build script.
Using glob_wildcards
will always ensure that all additional analysis is performed.
This state of our project is a natural end point for any researcher or research assistant who performs the data analysis in a project with colleagues who do not use Snakemake themselves. In such a workflow the data analysis and drafting stages of slides or a research paper are often separate processes.
For a single researcher or a group of researchers, who all work jointly with Snakemake, we have created the next chapter. There we will also automate the generation of a research paper draft and presentation slides in a single workflow together with the data analysis.
B References
Mankiw, N. Gregory, David Romer, and David N. Weil. 1992. “A Contribution to the Empirics of Economic Growth*.” The Quarterly Journal of Economics 107 (2): 407–37. https://doi.org/10.2307/2118477.