Basic Data Hygiene

Plan your Research Data Management

  • Start early. Make an RDM plan before collecting data.
  • Anticipate data products as part of your thesis outputs
  • Think about what technologies to use

Take initiative & responsibility. Think long term.



Data entering

extreme but in many ways defendable


excel: read only


Databases: more robust

  • good qc and advisable for multiple contributors

Databases: benefits


Have a look at the Data Carpentry SQL for Ecology lesson


Data formats

  • .csv: comma separated values.
  • .tsv: tab separated values.
  • .txt: no formatting specified.

more unusual formats will need instructions on use.


Ensure data is machine readable

bad


bad


good


ok

  • could help data entry
  • .csv or .tsv copy would need to be saved.

Use good null values

Missing values are a fact of life

  • Usually, best solution is to leave blank
  • NA or NULL are also good options
  • NEVER use 0. Avoid numbers like -999
  • Don’t make up your own code for missing values

read.csv() utilities

  • na.string: character vector of values to be coded missing and replaced with NA to argument eg
  • strip.white: Logical. if TRUE strips leading and trailing white space from unquoted character fields
  • blank.lines.skip: Logical: if TRUE blank lines in the input are ignored.
  • fileEncoding: if you’re getting funny characters, you probably need to specify the correct encoding.
read.csv(file, na.strings = c("NA", "-999"), strip.white = TRUE, 
         blank.lines.skip = TRUE, fileEncoding = "mac")

readr::read_csv() utilities

  • na: character vector of values to be coded missing and replaced with NA to argument eg
  • trim_ws: Logical. if TRUE strips leading and trailing white space from unquoted character fields
  • col_types: Allows for column data type specification. (see more)
  • locale: controls things like the default time zone, encoding, decimal mark, big mark, and day/month names
  • skip: Number of lines to skip before reading data.
  • n_max: Maximum number of records to read.
read_csv(file, col_names = TRUE, col_types = NULL, locale = default_locale(), 
         na = c("", "NA", "-999"), trim_ws = TRUE, skip = 0, n_max = Inf)

Basic quality control

Have a look at your data with View(df)


  • Check empty cells
  • Check the range of values (and value types) in each column matches expectation. Use summary(df)
  • Check units of measurement are what you expect
  • Check your software interprets your data correctly eg.
    for a data frame df;
    • see top few rows with head(df)
    • see structure of dataframe with str(df).
  • consider writing some simple QA tests (eg. checks against number of dimensions, sum of numeric columns etc)

pkg skimr

skimr provides a frictionless approach to displaying summary statistics the user can skim quickly to understand their data

install.packages("skimr")
library(skimr)
skim(mtcars)
## Skim summary statistics
##  n obs: 32 
##  n variables: 11 
## 
## Variable type: numeric 
##  variable missing complete  n   mean     sd    p0    p25 median    p75
##        am       0       32 32   0.41   0.5   0      0      0      1   
##      carb       0       32 32   2.81   1.62  1      2      2      4   
##       cyl       0       32 32   6.19   1.79  4      4      6      8   
##      disp       0       32 32 230.72 123.94 71.1  120.83 196.3  326   
##      drat       0       32 32   3.6    0.53  2.76   3.08   3.7    3.92
##      gear       0       32 32   3.69   0.74  3      3      4      4   
##        hp       0       32 32 146.69  68.56 52     96.5  123    180   
##       mpg       0       32 32  20.09   6.03 10.4   15.43  19.2   22.8 
##      qsec       0       32 32  17.85   1.79 14.5   16.89  17.71  18.9 
##        vs       0       32 32   0.44   0.5   0      0      0      1   
##        wt       0       32 32   3.22   0.98  1.51   2.58   3.33   3.61
##    p100     hist
##    1    ▇▁▁▁▁▁▁▆
##    8    ▆▇▂▇▁▁▁▁
##    8    ▆▁▁▃▁▁▁▇
##  472    ▇▆▁▂▅▃▁▂
##    4.93 ▃▇▁▅▇▂▁▁
##    5    ▇▁▁▆▁▁▁▂
##  335    ▃▇▃▅▂▃▁▁
##   33.9  ▃▇▇▇▃▂▂▂
##   22.9  ▃▂▇▆▃▃▁▁
##    1    ▇▁▁▁▁▁▁▆
##    5.42 ▃▃▃▇▆▁▁▂

pkg assertr

The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline so that data errors are spotted early and can be addressed quickly.

install.packages("assertr")

e.g confirm that mtcars

  • has the columns “mpg”, “vs”, and “am”
  • contains more than 10 observations
  • column for ‘miles per gallon’ (mpg) is a positive number

before further analysis:

library(dplyr)
library(assertr)
mtcars %>%
    verify(has_all_names("mpg", "vs", "am", "wt")) %>%
    verify(nrow(.) > 10) %>%
    verify(mpg > 0)

Raw data are sacrosanct


Give yourself less rope

  • It’s a good idea to revoke your own write permission to the raw data file.

  • Then you can’t accidentally edit it.

  • It also makes it harder to do manual edits in a moment of weakness, when you know you should just add a line to your data cleaning script.


Know your masters

  • identify the master copy of files
  • keep it safe and and accessible
  • consider version control
  • consider centralising

source: http://www.thebugplanetstore.com/store/master-file/


Avoid catastrophe

Backup: on disk

Backup: in the cloud

  • dropbox, googledrive etc.
  • if installed on your system, can programmatically access them through R
  • some version control

Backup: the Open Science Framework osf.io

  • version controlled
  • easily shareable
  • works with other apps (eg googledrive, github)
  • work on an interface with R (OSFr) is in progress. See more here

Backup: Github

  • most solid version control.

  • keep everything in one project folder.

  • Can be problematic with really large files.