Tidy data

Hadley Wickam’s paper on Tidy Data

http://www.jstatsoft.org/v59/i10/paper

Principles of tidy data

tidy data structure

One variable per column.
One row per observation.

why?

Tidy datasets are easy to manipulate, model and visualise

While one can do the exact same analyses with tidy and messy datasets/tools, the tidy approach will generally require much less code, and hence be faster to write, easier to debug, and easier to modify/maintain.

Play well with the tidyverse pkgs!

Best to record data in tidy format but package tidyr provides functions to tidy untidy data

Install tidyr

install.packages("tidyverse")

if you are asked about storing to a personal library just type ‘y’ (yes)

library(tidyr)
library(dplyr)

Examples of messy data

Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same row.

Most messy datasets can be tidied with a small set of tools:

gathering, separating and spreading.

1 - wide data

one variable over many columns

In this experiment three (rather unusually named) people were given two different drugs (a and b) and their heart rate was recorded:

messy

##      name  a  b
## 1  Wilbur 67 56
## 2 Petunia 80 90
## 3 Gregory 64 50

How many variables have we got? A sensible model we might want to fit:

heart rate ~ drug

How can we supply the data to a modelling function? (e.g. lm())

1 - wide data

We use the function gather() in the tidy package to reshape the dataframe from wide to long format

gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)

data : a data frame
key : name for the identifier of the columns to gather
value: name for the new variable being created
...: select the columns to be stacked
- you can use : to select all variables between two columns.
- use bare variable names (ie " " not required).

tidy  <- gather(data = messy, key = drug, value = heartrate, a:b)

1 - wide data

tidy

##      name drug heartrate
## 1  Wilbur    a        67
## 2 Petunia    a        80
## 3 Gregory    a        64
## 4  Wilbur    b        56
## 5 Petunia    b        90
## 6 Gregory    b        50

heart rate ~ drug

2 - clumped data

more than one variable in a single column

In this study, parasite counts were made on samples taken from 3 elephant faecal boluses and two locations

head(messy)

##   id   spl.id counts
## 1  1 1-centre      2
## 2  1  1-outer      1
## 3  1 2-centre      3
## 4  1  2-outer      1
## 5  1 3-centre      2
## 6  1  3-outer      1

2 - clumped data

more than one variable in a single column

We use separate() to split spl.id into bolus and location, using a regular expression to describe the character that separates them.

separate(data, col, into, sep , remove = TRUE, convert = FALSE, extra = "error", ...)

data : a data frame
col : name of column to split
into : name for the new variable being created (as character vector)
sep: character as regular expression to match to any non-alphanumeric values
- numeric indicates position of split
- length of sep should be one less than length of into

tidy <- separate(data = messy, into = c("bolus", "location"), col = spl.id, sep= "-")

2 - clumped data

more than one variable in a single column

tidy

##    id bolus location counts
## 1   1     1   centre      2
## 2   1     1    outer      1
## 3   1     2   centre      3
## 4   1     2    outer      1
## 5   1     3   centre      2
## 6   1     3    outer      1
## 7   2     1   centre      1
## 8   2     1    outer      5
## 9   2     2   centre      1
## 10  2     2    outer      3
## 11  2     3   centre      3
## 12  2     3    outer      2

3 - Combination of messy data

In this study, the time people spent on their phones was measured at two locations (work and home) and at two time points.

messy

##   id       trt   work.T1   home.T1   work.T2    home.T2
## 1  1 treatment 0.5956623 0.4228583 0.2716590 0.08547135
## 2  2   control 0.4743013 0.1499061 0.5287771 0.91268492
## 3  3 treatment 0.4792561 0.7288986 0.4275077 0.22465316
## 4  4   control 0.7397282 0.7099326 0.2761709 0.74067841

3 - Combination of messy data

First we first use gather() to turn columns work.T1, home.T1, work.T2 and home.T2 into a key-value pair of key and time.

tidier <- gather(data = messy, key = key, value = time, -id, -trt)
tidier

##    id       trt     key       time
## 1   1 treatment work.T1 0.59566231
## 2   2   control work.T1 0.47430135
## 3   3 treatment work.T1 0.47925610
## 4   4   control work.T1 0.73972824
## 5   1 treatment home.T1 0.42285826
## 6   2   control home.T1 0.14990608
## 7   3 treatment home.T1 0.72889865
## 8   4   control home.T1 0.70993262
## 9   1 treatment work.T2 0.27165902
## 10  2   control work.T2 0.52877712
## 11  3 treatment work.T2 0.42750768
## 12  4   control work.T2 0.27617091
## 13  1 treatment home.T2 0.08547135
## 14  2   control home.T2 0.91268492
## 15  3 treatment home.T2 0.22465316
## 16  4   control home.T2 0.74067841

3 - Combination of messy data

Next we use separate() to split the key into location and time, using a regular expression to describe the character that separates them.

tidy <- separate(data = tidier, col = key, into = c("location", "timepoint"), sep = "\\.") 

tidy

##    id       trt location timepoint       time
## 1   1 treatment     work        T1 0.59566231
## 2   2   control     work        T1 0.47430135
## 3   3 treatment     work        T1 0.47925610
## 4   4   control     work        T1 0.73972824
## 5   1 treatment     home        T1 0.42285826
## 6   2   control     home        T1 0.14990608
## 7   3 treatment     home        T1 0.72889865
## 8   4   control     home        T1 0.70993262
## 9   1 treatment     work        T2 0.27165902
## 10  2   control     work        T2 0.52877712
## 11  3 treatment     work        T2 0.42750768
## 12  4   control     work        T2 0.27617091
## 13  1 treatment     home        T2 0.08547135
## 14  2   control     home        T2 0.91268492
## 15  3 treatment     home        T2 0.22465316
## 16  4   control     home        T2 0.74067841

exercise: tidy 3 example data

Instructions & data are in this OSF project

best to get organised

create a new project (File > New Project > New Directory > New Project > tidydata)
create a folder named data/ in your project
- get the data from the OSF project
create a new script, e.g. tidy.R
tidy the data!

RStudio Data wrangling cheat sheet

Tidy data

ACCE Research Data and Project Management

01-02 May 2018, TUoS

Principles of tidy data

tidy data structure

why?

Tidy datasets are easy to manipulate, model and visualise

Play well with the tidyverse pkgs!

Install tidyr

Examples of messy data

Column headers are values, not variable names.

Multiple variables are stored in one column.

Variables are stored in both rows and columns.

Multiple types of observational units are stored in the same row.

Most messy datasets can be tidied with a small set of tools:

gathering, separating and spreading.

1 - wide data

one variable over many columns

1 - wide data

1 - wide data

2 - clumped data

more than one variable in a single column

2 - clumped data

more than one variable in a single column

2 - clumped data

more than one variable in a single column

3 - Combination of messy data

3 - Combination of messy data

3 - Combination of messy data

exercise: tidy 3 example data

Instructions & data are in this OSF project

best to get organised

tidy the data!