Hadley Wickam’s paper on Tidy Data
http://www.jstatsoft.org/v59/i10/paper
While one can do the exact same analyses with tidy and messy datasets/tools, the tidy approach will generally require much less code, and hence be faster to write, easier to debug, and easier to modify/maintain.
Best to record data in tidy format but package
tidyr
provides functions to tidy untidy data
install.packages("tidyverse")
if you are asked about storing to a personal library just type ‘y’ (yes)
library(tidyr)
library(dplyr)
In this experiment three (rather unusually named) people were given two different drugs (a and b) and their heart rate was recorded:
messy
## name a b
## 1 Wilbur 67 56
## 2 Petunia 80 90
## 3 Gregory 64 50
How many variables have we got? A sensible model we might want to fit:
heart rate ~ drug
How can we supply the data to a modelling function? (e.g. lm()
)
We use the function gather()
in the tidy package to reshape the dataframe from wide to long format
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
data
: a data framekey
: name for the identifier of the columns to gathervalue
: name for the new variable being created...
: select the columns to be stacked
:
to select all variables between two columns.tidy <- gather(data = messy, key = drug, value = heartrate, a:b)
tidy
## name drug heartrate
## 1 Wilbur a 67
## 2 Petunia a 80
## 3 Gregory a 64
## 4 Wilbur b 56
## 5 Petunia b 90
## 6 Gregory b 50
heart rate ~ drug
In this study, parasite counts were made on samples taken from 3 elephant faecal boluses and two locations
head(messy)
## id spl.id counts
## 1 1 1-centre 2
## 2 1 1-outer 1
## 3 1 2-centre 3
## 4 1 2-outer 1
## 5 1 3-centre 2
## 6 1 3-outer 1
We use separate()
to split spl.id
into bolus
and location
, using a regular expression to describe the character that separates them.
separate(data, col, into, sep , remove = TRUE, convert = FALSE, extra = "error", ...)
data
: a data framecol
: name of column to splitinto
: name for the new variable being created (as character vector)sep
: character as regular expression to match to any non-alphanumeric values
sep
should be one less than length of into
tidy <- separate(data = messy, into = c("bolus", "location"), col = spl.id, sep= "-")
tidy
## id bolus location counts
## 1 1 1 centre 2
## 2 1 1 outer 1
## 3 1 2 centre 3
## 4 1 2 outer 1
## 5 1 3 centre 2
## 6 1 3 outer 1
## 7 2 1 centre 1
## 8 2 1 outer 5
## 9 2 2 centre 1
## 10 2 2 outer 3
## 11 2 3 centre 3
## 12 2 3 outer 2
In this study, the time people spent on their phones was measured at two locations (work and home) and at two time points.
messy
## id trt work.T1 home.T1 work.T2 home.T2
## 1 1 treatment 0.5956623 0.4228583 0.2716590 0.08547135
## 2 2 control 0.4743013 0.1499061 0.5287771 0.91268492
## 3 3 treatment 0.4792561 0.7288986 0.4275077 0.22465316
## 4 4 control 0.7397282 0.7099326 0.2761709 0.74067841
First we first use gather()
to turn columns work.T1
, home.T1
, work.T2
and home.T2
into a key-value pair of key and time.
tidier <- gather(data = messy, key = key, value = time, -id, -trt)
tidier
## id trt key time
## 1 1 treatment work.T1 0.59566231
## 2 2 control work.T1 0.47430135
## 3 3 treatment work.T1 0.47925610
## 4 4 control work.T1 0.73972824
## 5 1 treatment home.T1 0.42285826
## 6 2 control home.T1 0.14990608
## 7 3 treatment home.T1 0.72889865
## 8 4 control home.T1 0.70993262
## 9 1 treatment work.T2 0.27165902
## 10 2 control work.T2 0.52877712
## 11 3 treatment work.T2 0.42750768
## 12 4 control work.T2 0.27617091
## 13 1 treatment home.T2 0.08547135
## 14 2 control home.T2 0.91268492
## 15 3 treatment home.T2 0.22465316
## 16 4 control home.T2 0.74067841
Next we use separate()
to split the key into location
and time
, using a regular expression to describe the character that separates them.
tidy <- separate(data = tidier, col = key, into = c("location", "timepoint"), sep = "\\.")
tidy
## id trt location timepoint time
## 1 1 treatment work T1 0.59566231
## 2 2 control work T1 0.47430135
## 3 3 treatment work T1 0.47925610
## 4 4 control work T1 0.73972824
## 5 1 treatment home T1 0.42285826
## 6 2 control home T1 0.14990608
## 7 3 treatment home T1 0.72889865
## 8 4 control home T1 0.70993262
## 9 1 treatment work T2 0.27165902
## 10 2 control work T2 0.52877712
## 11 3 treatment work T2 0.42750768
## 12 4 control work T2 0.27617091
## 13 1 treatment home T2 0.08547135
## 14 2 control home T2 0.91268492
## 15 3 treatment home T2 0.22465316
## 16 4 control home T2 0.74067841
File > New Project > New Directory > New Project > tidydata
)data/
in your project
tidy.R
RStudio Data wrangling cheat sheet