This vignette describes the ideal data analytical workflow used by the Schola Empirica team.
Objective
To do analysis and create great reports reproducibly and efficiently.
To make our lives easier.
To make life easier for our future selves.
Principles
1. Reproducible projects
Project structure should be such that it is obvious what happens where and the whole project can be rerun quickly, perhaps on new data. Ideally, the structure also facilitates efficiency, i.e. during analysis things are not rerun which only need to be done once etc.
2. Reproducible reports
Everything we submit to a client/partner/stakeholder should be based on data and code that together reproducibly create the report: no hacks, no manual edits if at all possible, no copy-pasting images from one place to another.
Version control helps us track when each report was created, using what code and what data, and can help us go back and fix things if needed. It also helps us avoid mostly duplicate versions of everything lying around.
Code and data are real.
3. Consistent style of outputs
Reports and charts are based on well-designed styles and templates which we use consistently.
Building blocks
On reproducibility frameworks
The workflow described here does not rely on a particular framework that would enforce a project structure and a way of orchestrating all the bits together.
It is more lightweight - it suggests a structure, a way of working,
and provides suggested integrators (such as build.R
and
shared.R
and the 00_
data load scripts). It
relies on simple order of execution and on the analyst putting the right
bits of code in the right places.
You are free to change any of this, but also responsible for making sure that the system you create actually works.
This lightweight approach also does not provide any optimalisation
with regard to build time, like a make
- or
drake
-based workflow would (e.g. by only rebuilding outputs
when te code has changed.)
If you would like a more rigid framework or need one that optimizes for computing time, look through your options in the Resources section below.
1. Project initiation via the project template
The reschola
package provides an RStudio project
template that (a) takes care of setting up your project on Github (if
you let it) and (b) creates a default project structure, incorporating
key parameters that you give it in setup.
2. Default project structure
Feel free to adapt this in any way that works and remains understandable to someone who is not you.
-
shared.R
for variables and perhaps functions shared by more scripts. By default contains GDrive URL and project title, if provided during setup -
001_retrieve-data.R
helps you download files from your GDrive folder, if set. You can also use it to store code for retrieving other data. This should only hold things which you expect to only run once, or refresh rarely - particularly things that take time or put a load on other servers. -
002_read-data.R
should hold code that reads the data and does any transformations immediately tied to data reading, e.g. setting data types or basic filtering. Again, this is code that you don’t expect to change as you work on the actual analysis. You may want to save the result intords
files indata-intermediate
(ordata-input
if it is simply anrds
mirror of the input data saved for quick access.) -
003_check-and-process-data.R
(if you plan to run it often, you may wish signal this by numbering it01_*
to rerun it often with the rest of the analysis, and you may also turn it into an Rmd file if that is more convenient.) This should process data indata-input
and save its outputs indata-processed
. -
[NN]_*.Rmd
where NN is 01-98 is the actual analysis - may be an exploratory script, a partial analysis, or a report. Expected to be run in the order of their numbering, but ideally key components should work off data saved indata-input
ordata-intermediate
. -
data-input
should contain only unaltered input data as downloaded. -
data-processed
should contain processed data files -
data-output
should contain data that you expect to share externally that are the output of your project -
charts-output
andreports-output
for the obvious -
99_reproducibility.R
by default contains a description of the system and environment used to run the analysis. Use it to store any other information useful for reproducing the analysis (but not passwords etc.)
You should feel free to move code between the analysis and the
00*
scripts as you discover data transformations that
should be made earlier on in the workflow.
Use build.R
to tie these together - when run, it should
rebuild the whole project from scratch, except perhaps downloading data.
You may want to build different versions of build_*.R
as
helpers for running different parts of the workflow while you work. If
you deal with lots of build scripts, use the wonderful (pun intended) {buildr} package.
3. Document templates
There are two templates in the package: Schola PDF report (prefered) and Schola Word report.
The PDF report use LaTeX to create typographically correct, completely vector-graphics document.
The Word report is simple: it creates a Word document with some nice custom defaults and styles.
4. ggplot2 theme
reschola
offers a ggplot2 theme,
theme_schola()
, which provides some sensible easthetic
defaults, including font choice, to make charts beautiful and
consistent.
The desired approach is to use this theme, alter its parameters if
needed, and then if necessary make other changes using another
theme()
call.
There are also a small number of other plotting utilities.
See the Making charts vignette for details on everything graphics related.
Step by step
1. Start a new project
In RStudio, go to
File > New Project > New directory > Standard Schola Empirica Project
.
Ideally, start from a clean RStudio session with no project open.
Fill in the fields (only directory name is mandatory), switch the Git
menu to get a (reschola/your) Github repo of you wish, select other
options if needed, check “Open in new session”, and click
Create
.
Other ways are possible but this gives you a good starting point and takes care of a lot of the setup hassle for you.
Not committing sensitive data to git
The data-input
and data-processed
have
.gitignore
files in to stop you form committing sensitive
data to git. Commit these .gitignore
files to git. But only
alter them if you are sure you know what you are doing.
Google Authentication
If you haven’t used the googledrive
package before, the
package may ask for authorisation to access Google Drive. This is
legitimate and you should grant access. This happens in the browser and
on some machines may cause project initiatition to freeze or stop. If
that happens, run googledrive::drive_auth()
, delete the
directory created by the previous project creation attempt, and try
creating the project again.
renv
: managing package dependencies
One of the most annoying barriers do reproducibility is when packages on which your code depends change over time and as a result your code breaks or behaves differently.
The most convenient and sophisticated way to handle this is to use
the renv
dependency management system.
renv
In short, it makes sure that your project holds a complete record of the exact package versions you are using when creating it.
When to use it:
Always, really, but especaially when:
- you start collaborating on a project with someone else
- you are putting a project aside for a while
- you want it to run on a remote machine, e.g. to build and publish a website through Travis CI or Github Actions
What it does:
- creates a project-specific library
- installs the packages on which your code depends into it
- records the exact versions of these packages in something called a lockfile - a small text file that you keep and commit into git alongside your code
- sets up the project such that the project library is rebuilt with the minimum amount of hassle when you open it up in the future or someone else clones it.
To save avoid wasting time, disk space and download bandwidth,
renv
keeps a copy of all the package versions used in your
projects in a shared per-computer cache. The project libraries only
contain links to that cache. That way you are not committing the package
code into your project, nor do the package files sit in your project
directory, and they only get downloaded once if multiple projects’
libraries use the save version of a package.
All you need to do is call renv::init()
at the beginning
and then renv::snapshot()
anytime you install new packages
or commit code.
See https://rstudio.github.io/renv/ for an intro to the package and https://environments.rstudio.com/ for a broader intro.
Renv is not part of the standard project setup in
reschola
so as not to increase the complexity of project
initiation, but it is much recommended that you use it.
2. Download the data
Use 001_retrieve-data.R
; if you have other data
retrieval, ideally the code for it should live here.
Do not edit the data by hand.
See tips for some packages that can help you retrieve data from public sources or other systems.
3. Read in, check and process the data
Use 002_read-data.R
. Add any other data reading that is
needed.
See readr::locale()
for handling encoding, decimal marks
and separators in CSVs. You might also need
readr::read_csv2()
.
Use readr::guess_encoding()
if the text comes in
garbled.
Use 003_check-and-process-data.R
. You may need to move
this into an RMarkdown document.
See tips for packages that can help you set up a structured data checking pipeline.
4. Explore/Analyse the data
I suggest you keep your data exploration in a separate script from your report; often the EDA will happen in the report as you go, but a better process perhaps is to develop bits of your analysis in one script/Rmd and only move bits of code into the report Rmd which is essential for building the report.
An RMarkdown Notebook might be an appropriate format for this.
See tips.html#data-exploration-1 for a list of appropriate tools for data exploration.
Approaches
There is a real trade off here: one way to do it is to work through
the analysis in the report script, perhaps hiding most through chunk
options (include = FALSE
) and only outputting into the
final format stuff that is relevant.
That way you get a sense of the thought process but also a bloated and circuitous script. Another is to do the analysis in one or more files and only moving bits into the report which are needed there.
That way you get a tight report script but at times disconnected from the analytical process.
One way to lighten the load is to hive off some work into partial
Rmarkdown files, typically named _something.Rmd
, and then
“insert” them into the main document via
```
See RMarkdown Cookbook on child documents.
5. Write reports using a template
Use draft_*
to quickly create a draft using the required
template.
The Word output of the two templates is aesthetically equivalent, but
the _word
template (and output format as set in the YAML
header) can do more sophisticated handling of e.g. cross-references.
Note the schola_word
format is based on the
bookdown::word_document2
format. This means it can be
customised like other
bookdown documents and even strung into a whole book.
Footnotes and cross-references
You can create a cross-reference to any section, e.g. link to section
Methods using [Methods]
or
[the methods section][Methods]
. This will show up as a link
in Word and HTML.
Create a footnote by using
Text^[This is a footnote]
.
You can also refer to tables, figures and equations. This only works
in the schola_word
output format (template).
Do it like this (note that @
is escaped with
\
):
-
See table \@ref(fig:graf6).
to ref to a table in a chunk namedgraf6
-
As table \@ref(tab:tab3) shows...
for a table in a chunk namedtab3
Note that these chunks need to have the fig.cap
set to a
non-empty string, and they need to have a chunk name without
underscores or any special character (camelCase style is
recommended). Yihui Xie says:
Try to avoid spaces, periods (.), and underscores (_) in chunk labels and paths. If you need separators, you are recommended to use hyphens (-) instead. For example, setup-options is a good label, whereas setup.options and chunk 1 are bad; fig.path = ‘figures/mcmc-’ is a good path for figure output, and fig.path = ‘markov chain/monte carlo’ is bad.
Edit _bookdown.yml
to change the words used for “Figure”
and “Table” in captions (doesn’t apply for PDF).
See more on cross-references in the bookdown guide.
Citations
Knitr and Rmarkdown incorporate a system for managing citations and bibliographies, which can take reference lists from a number of citation managers. For the basics, see the RMarkdown site, details are in the RMarkdown Cookbook.
Web output
In principle, if you want a HTML file you can just switch the format
to html_document
and it should work fine if perhaps the
details might differ slightly. See tips on how
to get that online.
Parameterising
If you expect your report to be rerun in some time with different data or a different parameter, like a changed date or name of something, you can make your report parameterised. See this brief guide or a longer explanation.
This is also useful if you are running the same report for a number of units of something, e.g. for different waves of research or different geographical units - see how the Urban Institute does it.
Visualising
Charts should be created using ggplot2
as far as
possible. Use the schola_theme()
theme.1
6. Iterate steps 3-6
None of this is a linear process. The only requirement is that from an external point of view (and that includes you in three months or two years), the process of rebuilding the report(s) and the entire project is linear.
But as you work, you will find bits of code that belong somewhere else; you will make data transformations in your report that you will then realize you can move to your data transformation script. You will load new data in a script and than move that loading code to an earlier script. That is fine - it will happen gradually through iteration, but the iteration should also move you towards more organised code.
The logic described by Emily Riederer in her RMarkdown driven development approach may be helpful here.
In the end, the scripts should follow these principes:
- each should be able to run separately, in the sense that it doesn’t fail, that is
- it reads its own data (possibly written by a previous script)
- it should write its data if another script is expected to use them (though ideally this would all be done by a data-transformation script early on)
- it should load its libraries, shared variable and functions
Don’t forget to update the README.md and other documentation as you
go, as well as build.R
and any other build*.R
scripts you may have.
Feel free to use git to go back and forth. Version control is your friend here. Something broke? You can go back to when it worked.
See workflow guidance for a primer on git and Github, which should be a core part of your process.
When a draft report goes out e.g. to stakeholders for feedback, it might be useful to create a git tag:
You can also designate certain snapshots as special with a tag, which is a name of your choosing. In a software project, it is typical to tag a release with its version, e.g., “v1.0.3”. For a manuscript or analytical project, you might tag the version submitted to a journal or transmitted to external collaborators. Figure 20.1 shows a tag, “draft-01”, associated with the last commit.
7. Finalise report
Run reschola::manage_docx_header_logos()
to replace
default Schola logo or add a client/funder logo.
8. Prepare project for reuse
Really just make sure that you have followed the steps. If you have, then:
- files should be named properly and with the correct order
- each Rmd file should run without error
-
build.R
should contain all scripts needed to run the whole thing in the right order; so should any other built-type script you may have created - README.md should contain workable instructions
Additionally, you should use renv
and snapshot the state
of the project library using renv::snapshot()
.
Tools for implementing good practices
Tidyverse approach and tidy data
See R for Data Science by Hadley Wickham and Garret Grolemund.
The rest mostly draws on What They Forgot to teach you about R, which seems to have the remedy to many common pains of working with R.
Blank slates
See RStudio part in setup for the options to set for this.)
Safe paths
Use the here
package instead of setwd()
to
make sure paths just
work.
File naming conventions
See Naming things by Jenny Bryan.
- machine readable: ASCII, no spaces, sensible separators
(
_
between parts,-
between words) - human readable: descriptive words in title, consistent logic across files
- plays well with default ordering: 01 x 11, sensible separators, all lowercase, YYYY-MM-DD dates
Safe storage of secret and confidential information
- use .Renviron for passwords (or look at
keyring
), never hard code them (usethis::edit_r_environ()
) - store individual data on team GDrive and only download for analysis; do not commit to git
Ignore files in git
You can add files you do not wish to commit to git’s ‘.gitignore’ file. That way, git will not even show those as new/changed. This works separately for each repo.
The easiest way to do this is to run
e.g. usethis::use_git_ignore("secret_file.R")
.
You need to commit the .gitignore
file.
Other bits of good practice to follow:
In R code:
- Document why you wrote that code (not necessarily just what it does, which should be obvious)
- don’t use T and F for TRUE and FALSE
- use only trustworthy packages, ideally CRAN-based unless there is a good reason to do otherwise.
See the goodpractice
package for a list of good practice and automated checking for them.
With version control
- Write informative commit messages
- don’t overwrite history (force push) unless absolutely necessary
- commit often, push carefully
- pushed commits should contain complete changes such that the code will run without errors.
- always work in UTF-8: save code in RStudio as UTF-8, save input CSV as UTF-8, and save any R text output as UTF-8.
Resources for building reproducible workflows
Sharla Gelfand on reproducible reporting with RMarkdown at rstudio::conf(2020) Emily Riederer on RMarkdown driven development
Wilson et al. 2016, “Good Enough Practices in Scientific Computing”
rrtools
for research compendia