Basic Data Hygiene

Plan your Research Data Management

Start early. Make an RDM plan before collecting data.
- RDM checklist
Anticipate data products as part of your thesis outputs
Think about what technologies to use

Take initiative & responsibility. Think long term.

Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— Oceans Initiative (@oceansresearch) January 16, 2015

Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— oceans initiative (@oceansresearch) January 16, 2015

Data entering

extreme but in many ways defendable

@tomjwebb stay away from excel at all costs?
— Timothée Poisot (@tpoi) January 16, 2015

excel: `read only`

@tomjwebb @tpoi excel is fine for data entry. Just save in plain text format like csv. Some additional tips: pic.twitter.com/8fUv9PyVjC
— Jaime Ashander (@jaimedash) January 16, 2015

@jaimedash just don’t let excel anywhere near dates or times. @tomjwebb @tpoi @larysar
— Dave Harris (@davidjayharris) January 16, 2015

Databases: more robust

good qc and advisable for multiple contributors

@tomjwebb databases? @swcarpentry has a good course on SQLite
— Timothée Poisot (@tpoi) January 16, 2015

@tomjwebb @tpoi if the data are moderately complex, or involve multiple people, best to set up a database with well designed entry form 1/2
— Luca Borger (@lucaborger) January 16, 2015

Databases: benefits

@tomjwebb Entering via a database management system (e.g., Access, Filemaker) can make entry easier & help prevent data entry errors @tpoi
— Ethan White (@ethanwhite) January 16, 2015

@tomjwebb it also prevents a lot of different bad practices. It is possible to do some of this in Excel. @tpoi
— Ethan White (@ethanwhite) January 16, 2015

@ethanwhite +1 Enforcing data types, options from selection etc, just some useful things a DB gives you, if you turn them on @tomjwebb @tpoi
— Gavin Simpson (@ucfagls) January 16, 2015

Have a look at the Data Carpentry SQL for Ecology lesson

Data formats

.csv: comma separated values.
.tsv: tab separated values.
.txt: no formatting specified.

@tomjwebb It has to be interoperability/openness - can I read your data with whatever I use, without having to convert it?
— Paul Swaddle (@paul_swaddle) January 16, 2015

more unusual formats will need instructions on use.

Ensure data is machine readable

bad

good

ok

could help data entry
.csv or .tsv copy would need to be saved.

Use good null values

Missing values are a fact of life

Usually, best solution is to leave blank
NA or NULL are also good options
NEVER use 0. Avoid numbers like -999
Don’t make up your own code for missing values

`read.csv()` utilities

na.string: character vector of values to be coded missing and replaced with NA to argument eg
strip.white: Logical. if TRUE strips leading and trailing white space from unquoted character fields
blank.lines.skip: Logical: if TRUE blank lines in the input are ignored.
fileEncoding: if you’re getting funny characters, you probably need to specify the correct encoding.

read.csv(file, na.strings = c("NA", "-999"), strip.white = TRUE, 
         blank.lines.skip = TRUE, fileEncoding = "mac")

`readr::read_csv()` utilities

na: character vector of values to be coded missing and replaced with NA to argument eg
trim_ws: Logical. if TRUE strips leading and trailing white space from unquoted character fields
col_types: Allows for column data type specification. (see more)
locale: controls things like the default time zone, encoding, decimal mark, big mark, and day/month names
skip: Number of lines to skip before reading data.
n_max: Maximum number of records to read.

read_csv(file, col_names = TRUE, col_types = NULL, locale = default_locale(), 
         na = c("", "NA", "-999"), trim_ws = TRUE, skip = 0, n_max = Inf)

Basic quality control

Have a look at your data with `View(df)`

Check empty cells
Check the range of values (and value types) in each column matches expectation. Use summary(df)
Check units of measurement are what you expect
Check your software interprets your data correctly eg.
for a data frame df;
- see top few rows with head(df)
- see structure of dataframe with str(df).
consider writing some simple QA tests (eg. checks against number of dimensions, sum of numeric columns etc)

pkg `skimr`

skimr provides a frictionless approach to displaying summary statistics the user can skim quickly to understand their data

install.packages("skimr")

library(skimr)
skim(mtcars)

## Skim summary statistics
##  n obs: 32 
##  n variables: 11 
## 
## Variable type: numeric 
##  variable missing complete  n   mean     sd    p0    p25 median    p75
##        am       0       32 32   0.41   0.5   0      0      0      1   
##      carb       0       32 32   2.81   1.62  1      2      2      4   
##       cyl       0       32 32   6.19   1.79  4      4      6      8   
##      disp       0       32 32 230.72 123.94 71.1  120.83 196.3  326   
##      drat       0       32 32   3.6    0.53  2.76   3.08   3.7    3.92
##      gear       0       32 32   3.69   0.74  3      3      4      4   
##        hp       0       32 32 146.69  68.56 52     96.5  123    180   
##       mpg       0       32 32  20.09   6.03 10.4   15.43  19.2   22.8 
##      qsec       0       32 32  17.85   1.79 14.5   16.89  17.71  18.9 
##        vs       0       32 32   0.44   0.5   0      0      0      1   
##        wt       0       32 32   3.22   0.98  1.51   2.58   3.33   3.61
##    p100     hist
##    1    ▇▁▁▁▁▁▁▆
##    8    ▆▇▂▇▁▁▁▁
##    8    ▆▁▁▃▁▁▁▇
##  472    ▇▆▁▂▅▃▁▂
##    4.93 ▃▇▁▅▇▂▁▁
##    5    ▇▁▁▆▁▁▁▂
##  335    ▃▇▃▅▂▃▁▁
##   33.9  ▃▇▇▇▃▂▂▂
##   22.9  ▃▂▇▆▃▃▁▁
##    1    ▇▁▁▁▁▁▁▆
##    5.42 ▃▃▃▇▆▁▁▂

pkg `assertr`

The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline so that data errors are spotted early and can be addressed quickly.

install.packages("assertr")

e.g confirm that mtcars

has the columns “mpg”, “vs”, and “am”
contains more than 10 observations
column for ‘miles per gallon’ (mpg) is a positive number

before further analysis:

library(dplyr)
library(assertr)
mtcars %>%
    verify(has_all_names("mpg", "vs", "am", "wt")) %>%
    verify(nrow(.) > 10) %>%
    verify(mpg > 0)

Raw data are sacrosanct

@tomjwebb don't, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script
— Gavin Simpson (@ucfagls) January 16, 2015

@tomjwebb @srsupp Keep one or a few good master data files (per data collection of interest), and code your formatting with good annotation.
— Desiree Narango (@DLNarango) January 16, 2015

Give yourself less rope

It’s a good idea to revoke your own write permission to the raw data file.
Then you can’t accidentally edit it.
It also makes it harder to do manual edits in a moment of weakness, when you know you should just add a line to your data cleaning script.

Know your masters

identify the master copy of files
keep it safe and and accessible
consider version control
consider centralising

source: http://www.thebugplanetstore.com/store/master-file/

Avoid catastrophe

Backup: on disk

consider using backup software like Time Machine (mac) or File History (Windows 10)

Backup: in the cloud

dropbox, googledrive etc.
if installed on your system, can programmatically access them through R
some version control

@tomjwebb Back it up
— Ben Bond-Lamberty (@BenBondLamberty) January 16, 2015

Backup: the Open Science Framework osf.io

version controlled
easily shareable
works with other apps (eg googledrive, github)
work on an interface with R (OSFr) is in progress. See more here

Backup: Github

most solid version control.
keep everything in one project folder.
Can be problematic with really large files.

Basic Data Hygiene

ACCE Research Data and Project Management

01-02 May 2018, TUoS

Basic Data Hygiene

Plan your Research Data Management

Take initiative & responsibility. Think long term.

Data entering

extreme but in many ways defendable

excel: `read only`

Databases: more robust

Databases: benefits

Data formats

more unusual formats will need instructions on use.

Ensure data is machine readable

bad

bad

good

ok

Use good null values

Missing values are a fact of life

`read.csv()` utilities

`readr::read_csv()` utilities

Basic quality control

Have a look at your data with `View(df)`

pkg `skimr`

pkg `assertr`

Raw data are sacrosanct

Give yourself less rope

Know your masters

Avoid catastrophe

Backup: on disk

Backup: in the cloud

Backup: the Open Science Framework osf.io

Backup: Github

Basic Data Hygiene

ACCE Research Data and Project Management

01-02 May 2018, TUoS

Basic Data Hygiene

Plan your Research Data Management

Take initiative & responsibility. Think long term.

Data entering

extreme but in many ways defendable

excel: read only

Databases: more robust

Databases: benefits

Data formats

more unusual formats will need instructions on use.

Ensure data is machine readable

bad

bad

good

ok

Use good null values

Missing values are a fact of life

read.csv() utilities

readr::read_csv() utilities

Basic quality control

Have a look at your data with View(df)

pkg skimr

pkg assertr

Raw data are sacrosanct

Give yourself less rope

Know your masters

Avoid catastrophe

Backup: on disk

Backup: in the cloud

Backup: the Open Science Framework osf.io

Backup: Github

excel: `read only`

`read.csv()` utilities

`readr::read_csv()` utilities

Have a look at your data with `View(df)`

pkg `skimr`

pkg `assertr`