Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— Oceans Initiative (@oceansresearch) January 16, 2015
Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— oceans initiative (@oceansresearch) January 16, 2015
@tomjwebb stay away from excel at all costs?
— Timothée Poisot (@tpoi) January 16, 2015
read only
@tomjwebb @tpoi excel is fine for data entry. Just save in plain text format like csv. Some additional tips: pic.twitter.com/8fUv9PyVjC
— Jaime Ashander (@jaimedash) January 16, 2015
@jaimedash just don’t let excel anywhere near dates or times. @tomjwebb @tpoi @larysar
— Dave Harris (@davidjayharris) January 16, 2015
@tomjwebb databases? @swcarpentry has a good course on SQLite
— Timothée Poisot (@tpoi) January 16, 2015
@tomjwebb @tpoi if the data are moderately complex, or involve multiple people, best to set up a database with well designed entry form 1/2
— Luca Borger (@lucaborger) January 16, 2015
@tomjwebb Entering via a database management system (e.g., Access, Filemaker) can make entry easier & help prevent data entry errors @tpoi
— Ethan White (@ethanwhite) January 16, 2015
@tomjwebb it also prevents a lot of different bad practices. It is possible to do some of this in Excel. @tpoi
— Ethan White (@ethanwhite) January 16, 2015
@ethanwhite +1 Enforcing data types, options from selection etc, just some useful things a DB gives you, if you turn them on @tomjwebb @tpoi
— Gavin Simpson (@ucfagls) January 16, 2015
Have a look at the Data Carpentry SQL for Ecology lesson
.csv
: comma separated values..tsv
: tab separated values..txt
: no formatting specified.@tomjwebb It has to be interoperability/openness - can I read your data with whatever I use, without having to convert it?
— Paul Swaddle (@paul_swaddle) January 16, 2015
.csv
or .tsv
copy would need to be saved.NA
or NULL
are also good options0
. Avoid numbers like -999
read.csv()
utilitiesna.string
: character vector of values to be coded missing and replaced with NA
to argument egstrip.white
: Logical. if TRUE
strips leading and trailing white space from unquoted character fieldsblank.lines.skip
: Logical: if TRUE
blank lines in the input are ignored.fileEncoding
: if you’re getting funny characters, you probably need to specify the correct encoding.read.csv(file, na.strings = c("NA", "-999"), strip.white = TRUE,
blank.lines.skip = TRUE, fileEncoding = "mac")
readr::read_csv()
utilitiesna
: character vector of values to be coded missing and replaced with NA
to argument egtrim_ws
: Logical. if TRUE
strips leading and trailing white space from unquoted character fieldscol_types
: Allows for column data type specification. (see more)locale
: controls things like the default time zone, encoding, decimal mark, big mark, and day/month namesskip
: Number of lines to skip before reading data.n_max
: Maximum number of records to read.read_csv(file, col_names = TRUE, col_types = NULL, locale = default_locale(),
na = c("", "NA", "-999"), trim_ws = TRUE, skip = 0, n_max = Inf)
View(df)
summary(df)
df
;
head(df)
str(df)
.skimr
skimr
provides a frictionless approach to displaying summary statistics the user can skim quickly to understand their data
install.packages("skimr")
library(skimr)
skim(mtcars)
## Skim summary statistics
## n obs: 32
## n variables: 11
##
## Variable type: numeric
## variable missing complete n mean sd p0 p25 median p75
## am 0 32 32 0.41 0.5 0 0 0 1
## carb 0 32 32 2.81 1.62 1 2 2 4
## cyl 0 32 32 6.19 1.79 4 4 6 8
## disp 0 32 32 230.72 123.94 71.1 120.83 196.3 326
## drat 0 32 32 3.6 0.53 2.76 3.08 3.7 3.92
## gear 0 32 32 3.69 0.74 3 3 4 4
## hp 0 32 32 146.69 68.56 52 96.5 123 180
## mpg 0 32 32 20.09 6.03 10.4 15.43 19.2 22.8
## qsec 0 32 32 17.85 1.79 14.5 16.89 17.71 18.9
## vs 0 32 32 0.44 0.5 0 0 0 1
## wt 0 32 32 3.22 0.98 1.51 2.58 3.33 3.61
## p100 hist
## 1 ▇▁▁▁▁▁▁▆
## 8 ▆▇▂▇▁▁▁▁
## 8 ▆▁▁▃▁▁▁▇
## 472 ▇▆▁▂▅▃▁▂
## 4.93 ▃▇▁▅▇▂▁▁
## 5 ▇▁▁▆▁▁▁▂
## 335 ▃▇▃▅▂▃▁▁
## 33.9 ▃▇▇▇▃▂▂▂
## 22.9 ▃▂▇▆▃▃▁▁
## 1 ▇▁▁▁▁▁▁▆
## 5.42 ▃▃▃▇▆▁▁▂
assertr
The assertr
package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline so that data errors are spotted early and can be addressed quickly.
install.packages("assertr")
e.g confirm that mtcars
before further analysis:
library(dplyr)
library(assertr)
mtcars %>%
verify(has_all_names("mpg", "vs", "am", "wt")) %>%
verify(nrow(.) > 10) %>%
verify(mpg > 0)
@tomjwebb don't, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb @srsupp Keep one or a few good master data files (per data collection of interest), and code your formatting with good annotation.
— Desiree Narango (@DLNarango) January 16, 2015
It’s a good idea to revoke your own write permission to the raw data file.
Then you can’t accidentally edit it.
It also makes it harder to do manual edits in a moment of weakness, when you know you should just add a line to your data cleaning script.
master
copy of filessource: http://www.thebugplanetstore.com/store/master-file/
R
@tomjwebb Back it up
— Ben Bond-Lamberty (@BenBondLamberty) January 16, 2015
most solid version control.
keep everything in one project folder.
Can be problematic with really large files.