Reproducible Data Analytic Workflows for Economics
An Introduction to Snakemake
This Version: 2021-03-14 – Draft Under Active Development
Preface
Why we wrote this tutorial
Empirical research workflows in economics and other social sciences typically take the form of a set of steps that need to be performed in a given order. For example, one may need to run a set of data cleaning and merging scripts one after the other before creating a set of summary statistics and figures, and then running some empirical modelling such as a series of linear regressions. These linear regression results need to be compiled into tables, and these tables need to be included in a research paper and a slide deck.
The number of steps involved in completing an entire project from initial data preparation, through to analysis and compiling the final results in a document quickly becomes large and the inter-relationships between steps complex.
Our own experience, and discussions with colleagues suggest that such a workflow quickly becomes unwieldy. It is difficult to remember the relationships between steps and hard to write them down in a way that co-authors and our future selves can access without blood pressures & stress levels rising. We found ourselves frustrated with the headaches manual management of our workflow created and searched for better ways of managing our research workflow.
This tutorial introduces the Snakemake workflow management system as a way to execute research workflows and manage the dependencies between successive steps that make up our workflow.
Snakemake is a Python 3 package that is available both via the conda
and pip
package management systems available to all Python users.
We have found the Snakemake system to be the best solution to our various workflow needs across our own research in business, economics and social science due to a combination of its’ readabilty and the ease of scaling up as the project becomes increasingly complex.
Intuitively you can think of Snakemake workflows as a set of of human readable steps designed to create a replicable analyses. We the researcher will need to write down each step in the form of a ‘recipe’
- what goes in,
- what comes out, and
- how do we combine the inputs to produce an output.
Once we have written these steps, Snakemake will execute them for us and track the dependencies between steps. This means if we update step \(x\), for example by changing how we merge data, then Snakemake will then look for other steps in our workflow that would use this updated data and re-execute them and any steps further down our analysis automatically.1
Our hope is that this tutorial provides a gentle introduction to Snakemake and replicable, automated workflow management. The tutorial assumes no knowledge of Snakemake or other workflow management software. Each chapter aims to build up knowledge through guided examples combined with exercises for which we provide solutions. Our aim is to build up a working understanding of how to use the software so that at the end of the tutorial the reader could get a new or existing project up and running using the Snakemake paradigm with minimal additional overhead.
While this tutorial is targeted at researchers who are considering adopting Snakemake for the first time, we hope it can also serve as a useful reference for people who have some working knowledge of Snakemake. We often come back to the material ourselves to find snippets of code to include and adapt into our own research workflows and anticipate others may do the same.
Pre-requisite Knowledge
To successfully work through this tutorial you will need some programming background with R, python and the command line. We have found that familiarity with the following concepts puts you in good stead to understand what follows:
- Working Knowledge of
R
- Including writing command line programs, and Dynamic Reports with
knitr
andrmarkdown
- For a quick review see Software Carpentry’s ‘Programming with R’ Lesson
- Including writing command line programs, and Dynamic Reports with
- Basic Knowledge of
Python
(Snakemake is written in Python 3)- Data Types, variable creation, lists and dictionaries
- For a quick review see sections 1, 5 and 12 of of Software Carpentry’s ‘Programming with Python’ Lesson
- Basic Working Knowledge of a
bash
terminal- File system structure, Changing directories, Making directories, Removing Files and Directories, Running a script from command line
- For a quick review see sections 1, 2, 3 and 6 of Software Carpentry’s ‘Unix Shell’ Lesson
What Needs to be Downloaded & Installed for this Tutorial
Template Scripts and Documents
The backbone of our tutorial is a dataset and a set of R
scripts that need to be executed in a particular order for the data analysis.
Two additional Rmarkdown
documents include templates of a paper and presentation slides which we will build alongside the analysis.
You will need to download these to work through the tutorial and build up your own Snakemake workflow.
We provide two ways to download these files:
Download a zip archive here. After downloading, proceed to unzip the folder and move it to a location in your computer that you feel comfortable to work with.2
Download the template scripts using Git:
When we have taught this tutorial live in person in the past, we have created a folder on each learner’s computer which we tell them to think of as a “safe space” where they can’t accidentally move, delete or otherwise modify any files on their computer they deem important by accident.
For example, if learners have a ‘coursework’ folder, we ask them to create a new sub-folder called “20YY-MM-programming class”, where YY and MM are the year and month we are giving the class.
We then ask them to download this folder or clone the repository inside this space.
Software Installation
In addition to the R
scripts and data, you will need to have some software installed on your computer.
You need to have the following installed:
- Access to a
bash
style terminal - Python 3 (Python 3.6 or higher)
- Snakemake
- R (version 4.0.x)
- Some R packages for additional functionality
We provide installation instructions in the Appendix.
What We Can’t Cover
Our goal of providing a gentle introduction to workflow management via Snakemake means that we have deliberately sacrificed details.
In particular, we do not cover:
- Basics of Python, R, or shells scripts
- A detailed discussion about the inner workings of Snakemake
- A wide-ranging introduction to all of Snakemake’s features (which continue to grow)
If you are interested in either of these, we recommend you look at the links to materials provided above, the Snakemake documentation (LINK) and the supporting technical publication (LINK). We have found the latter two of these references extremely useful – but quite dense and difficult to process for new-comers.
Acknowledgements
First and foremost, we would like to thank the Department of Economics at the University of Zurich for giving us the opportunity to
The idea of
This tutorial is a extension of the idea and implementation of reproducible workflow management that we taught during these classes.
We are indebted to the 2016 - 2020 cohorts of the PhD and research assistants who
Their questions, confused looks and XX have all improved
We would like to thank our colleagues and friends who co-taught the Programming Practices classes at
Lachlan and Julian were introduced to the concept of workflow management software by Hans-Martin von Gaudecker during a short course on Effective Programming Practices in 20XX. The introduction to Hans-Martin’s Templates for “Reproducible Research Projects in Economics” using Waf and (now) pytask provided much inspiration for the structure of our own template and this tutorial.
License
Instructional material
All instructional material made available under the tutorial titled “Reproducible Data Analytic Workflows with Snakemake and R
”
and authored by Ulrich Bergmann, Lachlan Deer and Julian Langer is
made available under the Creative Commons Attribution
license. The following is a human-readable summary of
(and not a substitute for) the full legal text of the CC BY 4.0
license.
You are free:
- to Share—copy and redistribute the material in any medium or format
- to Adapt—remix, transform, and build upon the material
for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution—You must give appropriate credit (mentioning that your work is derived from this tutorial and, where practical, linking to the material http://lachlandeer.github.io/snakemake-econ-r-tutorial), provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions—You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. With the understanding that:
Notices:
- You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
- No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
Software
Except where otherwise noted, the example programs and other software provided by Ulrich Bergmann, Lachlan Deer and Julian Langer are made available under the OSI-approved MIT license.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Citation
Please cite as:
Bergmann, U, Deer, L and Langer, J. (2021, February).
“Reproducible Data Analytic Workflows with Snakemake and R
: An Extended Tutorial for Researchers in Business, Economics and the Social Sciences”.
February 2021 (Version v2021.1.0)
https://lachlandeer.github.io/snakemake-econ-r-tutorial
@misc{bdl2020snakemakeecon,
title={Reproducible Data Analytic Workflows with Snakemake and `R`: An Extended Tutorial for Researchers in Business, Economics and the Social Sciences},
author={Ulrich Bergmann and Lachlan Deer and Julian Langer},
year={2021},
url = "https://lachlandeer.github.io/snakemake-econ-r-tutorial"
}
Learning Objectives
- Here is one
- and another
mmm
- Here is one
- and another
Title Title
Heres an exercise
Warning
warning warning
warning warning
Tip
tip tip
tip tip