Chapter 8 Automating List Construction for Wildcard Expansion
8.1 Learning Objectives
TODO!
8.2 State of the Snakefile
TODO: Make this a box + code fold
To proceed with this chapter, we expect you Snakefile to look as follows:
# --- Dictionaries --- #
MODELS = [
"model_solow",
"model_aug_solow"
]
DATA_SUBSET = [
"subset_oecd",
"subset_intermediate",
"subset_nonoil"
]
PLOTS = [
"aug_conditional_convergence",
"conditional_convergence",
"unconditional_convergence"
]
# --- Build Rules --- #
rule run_models:
input:
expand("out/analysis/{iModel}.{iSubset}.rds",
iSubset = DATA_SUBSET,
iModel = MODELS)
rule model:
input:
script = "src/analysis/estimate_ols_model.R",
data = "out/data/mrw_complete.csv",
model = "src/model-specs/{iModel}.json",
subset = "src/data-specs/{iSubset}.json"
output:
estimate = "out/analysis/{iModel}.{iSubset}.rds"
shell:
"Rscript {input.script} \
--data {input.data} \
--model {input.model} \
--subset {input.subset} \
--out {output.estimate}"
rule make_figures:
input:
expand("out/figures/{iFigure}.pdf",
iFigure = PLOTS)
rule figure:
input:
script = "src/figures/{iFigure}.R",
data = "out/data/mrw_complete.csv",
subset = "src/data-specs/subset_intermediate.json"
output:
fig = "out/figures/{iFigure}.pdf"
shell:
"Rscript {input.script} \
--data {input.data} \
--subset {input.subset} \
--out {output.fig}"
rule gen_regression_vars:
input:
script = "src/data-management/gen_reg_vars.R",
data = "out/data/mrw_renamed.csv",
param = "src/data-specs/param_solow.json"
output:
data = "out/data/mrw_complete.csv"
shell:
"Rscript {input.script} \
--data {input.data} \
--param {input.param} \
--out {output.data}"
rule rename_vars:
input:
script = "src/data-management/rename_variables.R",
data = "src/data/mrw.dta"
output:
data = "out/data/mrw_renamed.csv"
shell:
"Rscript {input.script} \
--data {input.data} \
--out {output.data}"
rule clean:
shell:
"rm -rf out/*"
rule clean_data:
shell:
"rm -rf out/data/*"
rule clean_analysis:
shell:
"rm -rf out/analysis/*"
In the previous chapter when we wanted to expand a wildcard we manually specified the valued we wanted them to take by constructing a list. By doing so, the beginning of our Snakefile looks like this:
MODELS = [
"model_solow",
"model_aug_solow"
]
DATA_SUBSET = [
"subset_oecd",
"subset_intermediate",
"subset_nonoil"
]
FIGURES = [
"aug_conditional_convergence",
"conditional_convergence",
"unconditional_convergence"
]
This is not too problematic when we only have a few values that we want the wildcard to take,
but manually specifying long lists can get tedious and is prone to error.
Snakemake has a built in function, glob_wildcards
that will help us to remove the manual listing of values that we had to do so far.
8.3 The glob_wildcards
Function
Let’s start by trying to replace the MODELS
list that we manually specified with a more automated approach.
The glob_wildcards()
function takes one input - the path of the files that we want to search combined with the part of the file name we want to extract written as a wildcard.
It then looks to find anything that along the specidied path, collecting what it finds along the way.
Let’s look at glob_wildcards
in action to get a sense of what will happen.
Replace our manually specified MODELS
list with the following code:
This says look inside the src/model-specs/
folder, and return any files we find there.
{fname}
is the name of the wildcard we use to tell snakemake we want the filenames from inside the folder.
The name fname
itself doesn’t matter, we could use whatever we liked, but we decided {fname}
is suggestive we want filenames so it helps us to understand what the code is doing.
Next, we want to look at what is now stored in MODELS
.
This is a litle tricker than it should be, but if we add a print(MODELS)
command in our Snakefile we can get the value printed to screen.
Let’s do that, and then do a dry run.
then:
What we are interested in is the first printed lines (in white text):
Wildcards(fname=['.gitkeep', 'model_solow.json', 'model_aug_cc_restr.json',
'model_solow_restr.json', 'model_cc.json', 'model_ucc.json',
'model_aug_solow_restr.json', 'model_aug_cc.json', 'model_aug_solow.json'])
Here we see that all files are returned.
Compared to our original MODELS
list we see three differences
- There are more .json files, because there is potentially more specifications to run
- There is a .gitkeep file
- Each ‘fname’ ends with .json, which we did’t have earlier.
- is not a problem, it reflects that there is more potential analysis files that we haven’t manually specified.
But (2) and (3) are problematic.
We can remove the .gitkeep file and the .json file endings from the list with one step:
telling the
glob_wildcards()
function to only return the part of the filename that comes before the .json.
We do this by updating our glob_wildcards()
call as follows;
This says store in the list MODELS
the part of the filename that comes before .json
for all files in src/model-specs
.
The .gitkeep will also not be not returned, because this file does not have a .json ending:
Now let’s try the dry run again:
Now the first line is:
Wildcards(fname=['model_solow', 'model_aug_cc_restr',
'model_solow_restr', 'model_cc', 'model_ucc',
'model_aug_solow_restr', 'model_aug_cc', 'model_aug_solow'])
That’s definitely an improvement.
We note that the dry-run command is still giving us an error though, complaining about missing input files.
That is because currently MODELS
is not yet the the list that is was when we manually specified it.14
The list we want is inside the Wildcards()
class, and we can see it has the name fname
which is the name of the wildcard we assigned ourselves.
Our final step is to extract this list so that we can use it like our old MODELS
list.
We do this as follows:
Read this as “get the list called fname
from inside the Wildcards() object & assign it to the name MODELS
”.
Now if we do a dry-run:
We are returned the following lines at the beginning of the output:
['model_solow', 'model_aug_cc_restr', 'model_solow_restr',
'model_cc', 'model_ucc', 'model_aug_solow_restr',
'model_aug_cc', 'model_aug_solow']
...
which looks like the list we had manually entered, but with more elements.
The dry-run output also showed us that there are many steps it wants to run next time it is executed.
This is because it will want to obtain the output from each of 8 models estimated on all three of the data subsets in DATA_SUBSET
.
Recall that in Chapter XX we estimated two models, aug_solow
and solow
.
This means there are six models left to run, each across 3 data sets, for a total of 18 steps.
If we look through the dry run output we see that is indeed what is going to happen:
Building DAG of jobs...
Job counts:
count jobs
18 model
1 run_models
19
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_aug_solow_restr.json, src/data-specs/subset_oecd.json
output: out/analysis/model_aug_solow_restr.subset_oecd.rds
jobid: 8
wildcards: iModel=model_aug_solow_restr, iSubset=subset_oecd
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_solow_restr.json, src/data-specs/subset_nonoil.json
output: out/analysis/model_solow_restr.subset_nonoil.rds
jobid: 17
wildcards: iModel=model_solow_restr, iSubset=subset_nonoil
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_aug_cc_restr.json, src/data-specs/subset_intermediate.json
output: out/analysis/model_aug_cc_restr.subset_intermediate.rds
jobid: 12
wildcards: iModel=model_aug_cc_restr, iSubset=subset_intermediate
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_aug_solow_restr.json, src/data-specs/subset_nonoil.json
output: out/analysis/model_aug_solow_restr.subset_nonoil.rds
jobid: 24
wildcards: iModel=model_aug_solow_restr, iSubset=subset_nonoil
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_ucc.json, src/data-specs/subset_oecd.json
output: out/analysis/model_ucc.subset_oecd.rds
jobid: 7
wildcards: iModel=model_ucc, iSubset=subset_oecd
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_aug_cc.json, src/data-specs/subset_intermediate.json
output: out/analysis/model_aug_cc.subset_intermediate.rds
jobid: 10
wildcards: iModel=model_aug_cc, iSubset=subset_intermediate
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_aug_cc.json, src/data-specs/subset_nonoil.json
output: out/analysis/model_aug_cc.subset_nonoil.rds
jobid: 18
wildcards: iModel=model_aug_cc, iSubset=subset_nonoil
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_aug_cc_restr.json, src/data-specs/subset_nonoil.json
output: out/analysis/model_aug_cc_restr.subset_nonoil.rds
jobid: 20
wildcards: iModel=model_aug_cc_restr, iSubset=subset_nonoil
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_cc.json, src/data-specs/subset_oecd.json
output: out/analysis/model_cc.subset_oecd.rds
jobid: 3
wildcards: iModel=model_cc, iSubset=subset_oecd
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_solow_restr.json, src/data-specs/subset_intermediate.json
output: out/analysis/model_solow_restr.subset_intermediate.rds
jobid: 9
wildcards: iModel=model_solow_restr, iSubset=subset_intermediate
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_aug_cc.json, src/data-specs/subset_oecd.json
output: out/analysis/model_aug_cc.subset_oecd.rds
jobid: 2
wildcards: iModel=model_aug_cc, iSubset=subset_oecd
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_ucc.json, src/data-specs/subset_intermediate.json
output: out/analysis/model_ucc.subset_intermediate.rds
jobid: 15
wildcards: iModel=model_ucc, iSubset=subset_intermediate
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_aug_cc_restr.json, src/data-specs/subset_oecd.json
output: out/analysis/model_aug_cc_restr.subset_oecd.rds
jobid: 4
wildcards: iModel=model_aug_cc_restr, iSubset=subset_oecd
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_cc.json, src/data-specs/subset_nonoil.json
output: out/analysis/model_cc.subset_nonoil.rds
jobid: 19
wildcards: iModel=model_cc, iSubset=subset_nonoil
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_aug_solow_restr.json, src/data-specs/subset_intermediate.json
output: out/analysis/model_aug_solow_restr.subset_intermediate.rds
jobid: 16
wildcards: iModel=model_aug_solow_restr, iSubset=subset_intermediate
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_solow_restr.json, src/data-specs/subset_oecd.json
output: out/analysis/model_solow_restr.subset_oecd.rds
jobid: 1
wildcards: iModel=model_solow_restr, iSubset=subset_oecd
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_cc.json, src/data-specs/subset_intermediate.json
output: out/analysis/model_cc.subset_intermediate.rds
jobid: 11
wildcards: iModel=model_cc, iSubset=subset_intermediate
[Fri Feb 19 21:46:25 2021]
rule model:
input: src/analysis/estimate_ols_model.R, out/data/mrw_complete.csv, src/model-specs/model_ucc.json, src/data-specs/subset_nonoil.json
output: out/analysis/model_ucc.subset_nonoil.rds
jobid: 23
wildcards: iModel=model_ucc, iSubset=subset_nonoil
[Fri Feb 19 21:46:25 2021]
localrule run_models:
input: out/analysis/model_solow_restr.subset_oecd.rds, out/analysis/model_aug_cc.subset_oecd.rds, out/analysis/model_cc.subset_oecd.rds, out/analysis/model_aug_cc_restr.subset_oecd.rds, out/analysis/model_aug_solow.subset_oecd.rds, out/analysis/model_solow.subset_oecd.rds, out/analysis/model_ucc.subset_oecd.rds, out/analysis/model_aug_solow_restr.subset_oecd.rds, out/analysis/model_solow_restr.subset_intermediate.rds, out/analysis/model_aug_cc.subset_intermediate.rds, out/analysis/model_cc.subset_intermediate.rds, out/analysis/model_aug_cc_restr.subset_intermediate.rds, out/analysis/model_aug_solow.subset_intermediate.rds, out/analysis/model_solow.subset_intermediate.rds, out/analysis/model_ucc.subset_intermediate.rds, out/analysis/model_aug_solow_restr.subset_intermediate.rds, out/analysis/model_solow_restr.subset_nonoil.rds, out/analysis/model_aug_cc.subset_nonoil.rds, out/analysis/model_cc.subset_nonoil.rds, out/analysis/model_aug_cc_restr.subset_nonoil.rds, out/analysis/model_aug_solow.subset_nonoil.rds, out/analysis/model_solow.subset_nonoil.rds, out/analysis/model_ucc.subset_nonoil.rds, out/analysis/model_aug_solow_restr.subset_nonoil.rds
jobid: 0
Job counts:
count jobs
18 model
1 run_models
19
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Let’s run Snakemake to get all of the estimates.
The power of using glob_wildcards()
is clear.
It’s a relatively easy way to construct a list to iterate over as part of a wildcard expansion based on the names of files.
8.4 Looking at the DAG (redux)
With the addition of all of these new model estimates, Figure XX shows that the project’s DAG has expanded in complexity.
The DAG again makes clear what our workflow is doing. Reading from the bottom of the figure upwards:
- Snakemake wants to build the inputs listed in
run_models
. This is the pairwise combination of allMODELS
that were found insrc/model-specs
and allDATA_SUBSETS
found insrc/data-specs
- It can build those inputs by running the
model
rule \(8 \times 3 = 24\) times when it iterates over the wildcards - To run each model, it needs the clean data, so it needs to run the two data cleaning steps first
!!! REMARK !!!
Notice (again) how the figures aren’t produced automatically.
Do you remember why?
If not, revisit Chapter XX.
You can build them with snakemake --cores 1 make_figures
.
Chapter YY will show us how to get all the outputs build with one run of Snakemake.
Exercise: Exploring glob_wildcards()
The Snakemake file still has two manually specified lists, DATA_SUBSET
and PLOTS
.
- Use the
glob_wildcards()
function to automate the the list construction. - Clean the output directory
- Run Snakemake so that all models and all figures have been produced.
8.4.1 Solution
Will need to run Snakemake twice snakemake --cores 1 run_models
and snakemake --cores 1 make_figures
.
# --- Dictionaries --- #
MODELS = glob_wildcards("src/model-specs/{fname}.json").fname
DATA_SUBSET = glob_wildcards("src/data-specs/{fname}.json").fname
PLOTS = glob_wildcards("src/figures/{fname}.R").fname
# --- Build Rules --- #
rule run_models:
input:
expand("out/analysis/{iModel}.{iSubset}.rds",
iSubset = DATA_SUBSET,
iModel = MODELS)
rule model:
input:
script = "src/analysis/estimate_ols_model.R",
data = "out/data/mrw_complete.csv",
model = "src/model-specs/{iModel}.json",
subset = "src/data-specs/{iSubset}.json"
output:
estimate = "out/analysis/{iModel}.{iSubset}.rds"
shell:
"Rscript {input.script} \
--data {input.data} \
--model {input.model} \
--subset {input.subset} \
--out {output.estimate}"
rule make_figures:
input:
expand("out/figures/{iFigure}.pdf",
iFigure = PLOTS)
rule figure:
input:
script = "src/figures/{iFigure}.R",
data = "out/data/mrw_complete.csv",
subset = "src/data-specs/subset_intermediate.json"
output:
fig = "out/figures/{iFigure}.pdf"
shell:
"Rscript {input.script} \
--data {input.data} \
--subset {input.subset} \
--out {output.fig}"
rule gen_regression_vars:
input:
script = "src/data-management/gen_reg_vars.R",
data = "out/data/mrw_renamed.csv",
param = "src/data-specs/param_solow.json"
output:
data = "out/data/mrw_complete.csv"
shell:
"Rscript {input.script} \
--data {input.data} \
--param {input.param} \
--out {output.data}"
rule rename_vars:
input:
script = "src/data-management/rename_variables.R",
data = "src/data/mrw.dta"
output:
data = "out/data/mrw_renamed.csv"
shell:
"Rscript {input.script} \
--data {input.data} \
--out {output.data}"
rule clean:
shell:
"rm -rf out/*"
rule clean_data:
shell:
"rm -rf out/data/*"
rule clean_analysis:
shell:
"rm -rf out/analysis/*"
If one replaces
print(MODELS)
withprint(type(MODELS))
we learn thatMODELS
is currently aclass
object rather than a list. It remains unclear to us as authors why this is what the Snakemake designers made this decision, but we have learned to embrace it. We hope you do too.↩︎