Good project layout helps ensure the
Do not manually edit raw data
Keep a clean pipeline of data processing from raw to analytical.
When your project is new and shiny, the script file usually contains many lines of directly executated code.
As it matures, reusable chunks get pulled into their own functions.
The actual analysis scripts then become relatively short, and use the functions defined in separate R scripts.
here
Use function here::here("path", "to", "file")
to create robust paths relative to the project root.
eg
here::here("path", "to", "file")
## [1] "/Users/Anna/Documents/workflows/workshops/ACCE/path/to/file"
Just pick a sensible / standardised file system structure and stick to it
Here I will show the approach advocated in Marwick B, Boettiger C, Mullen L. (2018) Packaging data analytical work reproducibly using R (and friends) PeerJ Preprints 6:e3192v2 https://doi.org/10.7287/peerj.preprints.3192v2
Abstract Computers are a central tool in the research process, enabling complex and large scale data analysis. As computer-based research has increased in complexity, so have the challenges of ensuring that this research is reproducible. To address this challenge, we review the concept of the research compendium as a solution for providing a standard and easily recognisable way for organising the digital materials of a research project to enable other researchers to inspect, reproduce, and extend the research. We investigate how the structure and tooling of software packages of the R programming language are being used to produce research compendia in a variety of disciplines. We also describe how software engineering tools and services are being used by researchers to streamline working with research compendia. Using real-world examples, we show how researchers can improve the reproducibility of their work using research compendia based on R packages and related tools.
Proposal for a reproducible research compendium:
Wickham, H. (2017) Research compendia. Note prepared for the 2017 rOpenSci Unconf
An scripts/
directory that contains R scripts (.R
), notebooks (.Rmd
), and intermediate data.
A DESCRIPTION
file that provides metadata about the compendium. Most importantly, it would list the packages needed to run the analysis. Would contain field to indicate that this is an analysis, not a package.
An R/
directory which contains R files that provide high-stakes functions.
A data/
directory which contains high-stakes data.
A tests/
directory that contains unit tests for the code and data.
A vignettes/
directory that contains high-stakes reports.
A build system which executes the contents of scripts/
and vignettes/
, producing final data and reports.
A .Rbuildignore
file that ignores the scripts/
directory.
A man/
directory which contains roxygen2-generated documentation for the reusable functions and data.
It’s like agreeing that we will all drive on the left or the right. A hallmark of civilization is following conventions that constrain your behavior a little, in the name of public safety.
Jenny Bryan on Project-oriented workflows
A place for everything, everything in its place.
Benjamin Franklin
You will be able to find your way around any of your projects
You will be able to find your way around any project by others following same convention
You will be able to find your way around any r package on GitHub!
R/
dirCreate new function scripts using usethis::use_r("functionality-name")
usethis::use_r("gapminder_process")
.
├── R
│ └── gapminder_process.R
├── data
│ ├── clean
│ ├── metadata
│ │ ├── gapminder_meta.csv
│ │ └── gapminder_meta_shell.csv
│ └── raw
│ └── gapminder.csv
└── gapminderRR.Rproj
└── scripts
└── helper01_create-metadata.R
rel_pop_get
: takes a single country subset of the gapminder dataframe and calculates relative population compared to the population in that country at the base_yearrel_pop_get <- function(df, base_year){
df$pop_rel <- df$pop/df$pop[df$year == base_year]
df
}
gapmider_rel_pop
: does a few checks with respect to base_year
and if all is good, applies rel_pop_get
to the subset data of each country.gapmider_rel_pop <- function(gapminder_df, base_year = 1952) {
# check that the base year is a valid year.
valid_b_y <- base_year %in% unique(gapminder_df$year) & is.numeric(base_year)
if (!valid_b_y) {
stop("base_year ", base_year,
" not a valid year in gapminder dataset. \nValid years are: ",
paste(unique(gapminder_df$year), collapse = ","))
}
# group by country and calculate population relative to baseline year
gapminder_df %>%
dplyr::filter(year >= base_year) %>%
dplyr::group_by(country) %>% dplyr::do(rel_pop_get(., base_year)) %>%
dplyr::ungroup()
}
Roxygen2
allows you to write specially-structured comments preceding each function definition. These are processed automatically to produce .Rd
help files for your functions and control which are exported to the name space.
See Karl Broman’s blogpost on writing Roxygen2 documentation
install.packages("roxygen2")
In Rstudio, you can insert a roxygen skeleton by placing the cursor anywhere in the definition of a function, then clicking:
Code > Insert Roxygen Skeleton
#' Title
#'
#' @param gapminder_df
#' @param base_year
#'
#' @return
#' @export
#'
#' @examples
gapmider_rel_pop <- function(gapminder_df, base_year = 1952) {
# check that the base year is a valid year.
valid_b_y <- base_year %in% unique(gapminder_df$year) & is.numeric(base_year)
if (!valid_b_y) {
stop("base_year ", base_year,
" not a valid year in gapminder dataset. \nValid years are: ",
paste(unique(gapminder_df$year), collapse = ","))
}
# group by country and calculate population relative to baseline year
gapminder_df %>%
dplyr::filter(year >= base_year) %>%
dplyr::group_by(country) %>% dplyr::do(rel_pop_get(.), base_year) %>%
dplyr::ungroup()
}
roxygen
notation indicated by beginning line with #'
.
First line will be the title for the function.
After title, include a blank #'
line and then write a longer description.
@param argument_name
description of the argument.
@return
description of what the function returns.
@export
tells Roxygen2
to add this function to the NAMESPACE
file, so that it will be accessible to users.
@importFrom pkgname functionname
imports a function directly from another package, allowing us to use it without specifying the namespace (ie without using ::
). Here, by using @importFrom dplyr %>%
I’m importing the pipe so I can use it directly in the function.
#' Calculate population relative to a baseline year
#'
#' Calculate population at different timepoints, for each country in the gapminder
#' dataset relative to a baseline year. Append result to the dataframe as a new column.
#' @param gapminder_df dataframe of gapminder data
#' @param base_year numeric. Year to be taken as baseline for the calculation of
#' of relative population change
#'
#' @return gapminder_df with relative population added as column pop_rel
#' @export
#'
#' @importFrom dplyr %>%
#' @examples
gapmider_rel_pop <- function(gapminder_df, base_year = 1952) {
# check that the base year is a valid year.
valid_b_y <- base_year %in% unique(gapminder_df$year) & is.numeric(base_year)
if (!valid_b_y) {
stop("base_year ", base_year,
" not a valid year in gapminder dataset. \nValid years are: ",
paste(unique(gapminder_df$year), collapse = ","))
}
# group by country and calculate population relative to baseline year
gapminder_df %>%
dplyr::filter(year >= base_year) %>%
dplyr::group_by(country) %>% dplyr::do(rel_pop_get(.), base_year) %>%
dplyr::ungroup()
}
I’m not exporting this function so full documentation is not necessary
# calculate population relative to a baseline year from gapminder single country
# subset
rel_pop_get <- function(df, base_year) {
df$pop_rel <- df$pop/df$pop[df$year == base_year]
df
}
Create a new .R
script using usethis::use_r()
.
Copy the two functions I created above (including the Roxygen documentation)
Write an additionl function that takes gapminder_df
after relative population has been calculated, calculates the % change compared to the baseline population in different years, adds it as new column pop_perc
to gapminder_df
and returns the whole dataframe.
(Hint a relative population of 1, represents 0% change)
#' Calculate population percentage change from baseline
#'
#' Calculate population percentage change from baseline from relative population
#' @param gapminder_df dataframe of gapminder data. Must contain pop_rel column.
#'
#' @return gapminder_df with population percentage change from baseline appended
#' as column pop_perc
#' @export
#'
pop_perc_get <- function(gapminder_df){
gapminder_df$pop_perc <- (gapminder_df$pop_rel - 1) * 100
gapminder_df
}
DESCRIPTION
fileThe job of the DESCRIPTION file is to store important metadata about our analysis package.
usethis::use_description()
.
├── DESCRIPTION
├── R
│ └── gapminder_process.R
├── data
│ ├── clean
│ ├── metadata
│ │ ├── gapminder_meta.csv
│ │ └── gapminder_meta_shell.csv
│ └── raw
│ └── gapminder.csv
└── gapminderRR.Rproj
└── scripts
└── helper01_create-metadata.R
DESCRIPTION
file skeletonPackage: gapminderRR
Version: 0.0.0.9000
Title: What the Package Does (One Line, Title Case)
Description: What the package does (one paragraph).
Authors@R: person("First", "Last", email = "first.last@example.com", role = c("aut", "cre"))
License: What license is it under?
Encoding: UTF-8
LazyData: true
ByteCompile: true
DESCRIPTION
file completePackage: gapminderRR
Version: 0.0.0.9000
Title: Analysing gapminder data to teach reproducible research in R
Description: Analysing gapminder data to teach reproducible research in R. In
particular, we cover using r package development tools and conventions using usethis,
devtools and rrtools, literate programming using rmarkdown and version control through
git and github.
Authors@R: person("Anna", "Krystalli", email = "annakrystalli@googlemail.com", role = c("aut", "cre"))
License: What license is it under?
Encoding: UTF-8
LazyData: true
ByteCompile: true
RoxygenNote: 6.0.1
Restart RStudio for the package Build tab to appear Session > New Session
In the Build tab, click more > Clean and Rebuild
.
├── DESCRIPTION
├── NAMESPACE
├── R
│ └── gapminder_process.R
├── data
│ ├── clean
│ │ └── gapminder.csv
│ ├── metadata
│ │ ├── gapminder_meta.csv
│ │ └── gapminder_meta_shell.csv
│ └── raw
│ └── gapminder.csv
├── gapminderRR.Rproj
├── man
│ └── gapmider_rel_pop.Rd
└── scripts
└── helper01_create-metadata.R
Try ?gapmider_rel_pop
!!
Or
gapmider_rel_pop(gapminder::gapminder)
Click on the Check button (📋 ✅)
So far we have one dependency, package dplyr
. How can we make sure it is available when someone wants to reproduce our work and use our functions? Already our package build checks are complaining at us!
Using package pacman
Using a DESCRIPTION
file (the approach we will use)
Using package packrat
pacman
Nice package that can install (if required) and load specified packages.
Here’s a snippet of code that can do this:
#' install and load dependencies through pkg "pacman"
pkgs <- c("dplyr")
if (!require("pacman")) { install.packages("pacman") }
pacman::p_load(pkgs, character.only = T)
Can’t specify the exact version of the package installed (defaults to the latest version on CRAN).
packrat
Creates a local library of packages used in an analysis within the project.
Isolated: Installing a new or updated package for one project won’t break your other projects, and vice versa. That’s because packrat gives each project its own private package library.
Portable: Easily transport your projects from one computer to another, even across different platforms. Packrat makes it easy to install the packages your project depends on.
Reproducible: Packrat records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go.
Packrat definitely worth consideration (see pkg documentation and managing dependencies section in Programming chapter of BES Guide to Reproducible Code in Ecology and Evolution. BUT it can significantly increase the size of projects and functionality siloed.
DESCRIPTION
We can add dependendencies to our description file using function usethis::use_package()
usethis::use_package("dplyr")
We can specify minimum version requirements but not the exact version of a package.
dplyr
has now been added to Imports:Package: gapminderRR
Version: 0.0.0.9000
Title: Analysing gapminder data to teach reproducible research in R
Description: Analysing gapminder data to teach reproducible research in R. In
particular, we cover using r package development tools and conventions using usethis,
devtools and rrtools, literate programming using rmarkdown and version control through
git and github.
Authors@R: person("Anna", "Krystalli", email = "annakrystalli@googlemail.com", role = c("aut", "cre"))
License: What license is it under?
Encoding: UTF-8
LazyData: true
ByteCompile: true
RoxygenNote: 6.0.1
Imports:
dplyr
Try running the checks again now.
If this confused you too much:
Sorrrrrryyy!!!!
Check out this excellent blogpost by Rich Fitzjohn on Designing projects that is a simple version of what we just tried to do (minus the package functionality).