Iteration
Let’s say we want to repeat a process multiple times, iterating over a number of inputs. In this case we want to load every file in /data-raw/wood-survey-data-master/individual/
.
We have a few options for how to approach this problem. In R there are two paradigms for iteration:
- imperative iterations: (for and while loops)
- great place to start because they make iteration very explicit.
- quite verbose, and require quite a bit of bookkeeping code that is duplicated for every
for
loop.
- functional programming: using functions to iterate over other functions.
- focus is on the operation being performed rather than the bookkeeping.
- can be more elegant and succinct.
Iterating using loops
Simple loop
Here’s an example of a simple loop. During each iteration, it prints a message to the console, reporting the value of i
.
for(i in 1:10){
print(paste0("i is ", i))
}
## [1] "i is 1"
## [1] "i is 2"
## [1] "i is 3"
## [1] "i is 4"
## [1] "i is 5"
## [1] "i is 6"
## [1] "i is 7"
## [1] "i is 8"
## [1] "i is 9"
## [1] "i is 10"
The loop iterates over the vector of values supplied in 1:10
, sequentially assigning a new value to variable i
each iteration. i
is therefore the varying input and everything else in the code stays the same during each iteration.
Loops in practice
Reading in multiple files
Let’s now apply a loop to read in all 67 files at once.
We have the file paths in our individual_paths
vector. This is the input we want to iterate over. We can use a for
loop to supply each path as the file
argument in readr::read_csv()
.
Storing loop outputs
The previous loop we saw didn’t generate any new objects, it just printed output to the console. We, however, need to store the output of each iteration (the tibble we’ve just read in).
It’s important for efficiency to allocate sufficient space for the output before starting a for
loop. Growing the for
loop at each iteration, using c()
for example, will be very slow.
Let’s create an output vector to store the tibbles containing the read in data. We want it to be a list because we’ll be storing heterogeneous objects (tibbles) in each element.
<- vector("list", length(individual_paths))
indiv_df_list
head(indiv_df_list)
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
We’ve used the length()
of the input to specify the size of our output list so each path gets an output element.
Looping over indices
Next, we need a sequence of indices as long as the input vector (individual_paths
). We can use seq_along()
to create our index vector:
seq_along(individual_paths)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
Now we’re ready to write our for
loop.
for(i in seq_along(individual_paths)){
<- readr::read_csv(individual_paths[i])
indiv_df_list[[i]] }
At each step of the iteration, the file specified in the i
th element of individual_paths
is read in and assigned to th i
th element of our output list.
We can extract individual tibbles using [[
sub-setting to inspect:
1]] indiv_df_list[[
## # A tibble: 376 × 12
## uid namedLocation date eventID domainID siteID plotID individualID
## <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
## 1 a36a16… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 2 68dc7a… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 3 a8951a… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 4 eb348e… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 5 2a4478… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 6 e48520… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 7 280c90… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 8 0e5060… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 9 4918ca… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 10 ef16cb… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## # … with 366 more rows, and 4 more variables: growthForm <chr>,
## # stemDiameter <dbl>, measurementHeight <dbl>, height <dbl>
2]] indiv_df_list[[
## # A tibble: 714 × 12
## uid namedLocation date eventID domainID siteID plotID individualID
## <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
## 1 fb75a8… BART_036.base… 2015-09-01 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 2 30a7c7… BART_046.base… 2015-09-01 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 3 789d03… BART_072.base… 2015-09-01 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 4 0e4fb3… BART_072.base… 2015-09-01 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 5 cb0e45… BART_036.base… 2015-09-01 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 6 5fc5cf… BART_072.base… 2015-09-01 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 7 5d15fa… BART_046.base… 2015-09-01 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 8 d27a1b… BART_036.base… 2015-09-01 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 9 d5f9ab… BART_036.base… 2015-09-01 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 10 e52c3b… BART_036.base… 2015-09-01 vst_BA… D01 BART BART_… NEON.PLA.D0…
## # … with 704 more rows, and 4 more variables: growthForm <chr>,
## # stemDiameter <dbl>, measurementHeight <dbl>, height <dbl>
We can also inspect the contents of our output list interactively using View()
Looping over objects
We can also loop over objects instead of indices.
<- vector("list", length(individual_paths))
indiv_df_list names(indiv_df_list) <- basename(individual_paths)
for(path in individual_paths){
basename(path)]] <- readr::read_csv(path)
indiv_df_list[[ }
In this case, we supply the paths themselves as the input to our loop and these are passed as-is to read_csv()
.
This time we don’t have our element indices to index the elements of the output list each tibble should be stored in. To get around this we assign names to each element and index the output list by name.
I’ve chosen to use the basename
(actual file name) of each path as a name, which I can get through basename()
.
1]
individual_paths[basename(individual_paths[1])
## /cloud/project/data-raw/wood-survey-data-master/individual/NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2015-08.basic.20190806T172340Z.csv
## [1] "NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2015-08.basic.20190806T172340Z.csv"
Collapsing our output list into a single tibble.
Now we’ve got our list of tibbles, we want to collapse or “reduce” our output list into a single tibble. There are a number of ways to do this in R.
Base R
One first approach we might think of is to use base function rbind()
. This takes any number of tibbles as arguments and binds them all together.
rbind(indiv_df_list) %>% head()
## NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2015-08.basic.20190806T172340Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2015-09.basic.20190806T144119Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2016-08.basic.20190806T143255Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2016-09.basic.20190806T143433Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2016-10.basic.20190806T144133Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2017-07.basic.20190806T144111Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2017-08.basic.20190806T143426Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2017-09.basic.20190806T143740Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2018-08.basic.20190806T143026Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.BART.DP1.10098.001.vst_apparentindividual.2018-09.basic.20190806T144743Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2015-08.basic.20190806T155155Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2015-09.basic.20190806T155228Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2015-10.basic.20190806T160029Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2015-11.basic.20190806T155340Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2016-07.basic.20190806T154424Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2016-08.basic.20190806T155619Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2016-09.basic.20190806T155751Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2016-10.basic.20190806T154902Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2017-07.basic.20190806T161731Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2017-08.basic.20190806T155239Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2017-09.basic.20190806T154054Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2017-10.basic.20190806T154917Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2018-09.basic.20190806T154756Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D01.HARV.DP1.10098.001.vst_apparentindividual.2018-10.basic.20190904T080421Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D02.BLAN.DP1.10098.001.vst_apparentindividual.2015-09.basic.20190806T180623Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D02.BLAN.DP1.10098.001.vst_apparentindividual.2015-10.basic.20190806T180501Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D02.BLAN.DP1.10098.001.vst_apparentindividual.2016-09.basic.20190806T180452Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D02.BLAN.DP1.10098.001.vst_apparentindividual.2016-11.basic.20190806T162810Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D02.BLAN.DP1.10098.001.vst_apparentindividual.2017-09.basic.20190806T180226Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D02.BLAN.DP1.10098.001.vst_apparentindividual.2017-10.basic.20190806T162804Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D02.BLAN.DP1.10098.001.vst_apparentindividual.2018-09.basic.20190806T162758Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D02.BLAN.DP1.10098.001.vst_apparentindividual.2018-11.basic.20190930T153245Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D03.DSNY.DP1.10098.001.vst_apparentindividual.2018-01.basic.20190806T170456Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D03.DSNY.DP1.10098.001.vst_apparentindividual.2018-05.basic.20190806T165614Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2015-08.basic.20190806T155333Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2016-02.basic.20190806T151351Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2016-03.basic.20190806T151416Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2016-04.basic.20190806T151437Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2016-05.basic.20190806T154733Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2016-06.basic.20190806T155301Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2016-07.basic.20190806T155324Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2016-08.basic.20190806T153300Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2016-09.basic.20190806T151857Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2016-10.basic.20190806T154351Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2016-11.basic.20190806T152215Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2017-03.basic.20190806T152514Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2017-04.basic.20190806T154915Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2017-12.basic.20190806T164409Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2018-01.basic.20190806T150606Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D04.GUAN.DP1.10098.001.vst_apparentindividual.2018-02.basic.20190806T150635Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D07.GRSM.DP1.10098.001.vst_apparentindividual.2015-05.basic.20190806T151458Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D07.GRSM.DP1.10098.001.vst_apparentindividual.2015-06.basic.20190806T151516Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D07.GRSM.DP1.10098.001.vst_apparentindividual.2016-10.basic.20190806T154811Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D07.GRSM.DP1.10098.001.vst_apparentindividual.2016-11.basic.20190806T154932Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D07.GRSM.DP1.10098.001.vst_apparentindividual.2017-10.basic.20190806T155315Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D07.GRSM.DP1.10098.001.vst_apparentindividual.2017-11.basic.20190806T164155Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D07.GRSM.DP1.10098.001.vst_apparentindividual.2018-11.basic.20190930T154643Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D08.DELA.DP1.10098.001.vst_apparentindividual.2015-06.basic.20190806T173627Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D08.DELA.DP1.10098.001.vst_apparentindividual.2015-07.basic.20190806T165116Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D08.DELA.DP1.10098.001.vst_apparentindividual.2015-09.basic.20190806T160924Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D08.DELA.DP1.10098.001.vst_apparentindividual.2016-09.basic.20190806T161107Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D08.DELA.DP1.10098.001.vst_apparentindividual.2016-10.basic.20190806T155600Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D08.DELA.DP1.10098.001.vst_apparentindividual.2017-10.basic.20190806T161504Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D08.DELA.DP1.10098.001.vst_apparentindividual.2017-11.basic.20190806T165044Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D08.DELA.DP1.10098.001.vst_apparentindividual.2018-10.basic.20190904T074322Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D08.DELA.DP1.10098.001.vst_apparentindividual.2018-11.basic.20190930T162311Z.csv
## indiv_df_list spec_tbl_df,12
## NEON.D09.DCFS.DP1.10098.001.vst_apparentindividual.2015-09.basic.20190806T161704Z.csv
## indiv_df_list spec_tbl_df,12
Hmm, that doesn’t seem to have done what we want. That’s because rbind
expects multiple tibbles as inputs and we’re giving it a single list. We somehow want to extract the contents of each element of indiv_df_list
and pass them all to rbind
.
For this we can use do.call
.do.call
takes a function or the name of a function we want to execute as it’s first argument, what
. The second argument of do.call
, args
is a list of arguments we want to pass to the function specified in what
. When do.call
is executed, it extracts the elements of args
and passes them as arguments to what
.
do.call(what = "rbind", args = indiv_df_list)
## # A tibble: 14,961 × 12
## uid namedLocation date eventID domainID siteID plotID individualID
## * <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
## 1 a36a16… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 2 68dc7a… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 3 a8951a… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 4 eb348e… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 5 2a4478… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 6 e48520… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 7 280c90… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 8 0e5060… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 9 4918ca… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 10 ef16cb… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## # … with 14,951 more rows, and 4 more variables: growthForm <chr>,
## # stemDiameter <dbl>, measurementHeight <dbl>, height <dbl>
Success!
Tidyverse
There are also ways to do this using the tidyverse.
purrr::reduce
reduce
from package purrr
combines the elements of a vector or list into a single object according to the function supplied to .f
.
::reduce(indiv_df_list, .f = rbind) purrr
## # A tibble: 14,961 × 12
## uid namedLocation date eventID domainID siteID plotID individualID
## <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
## 1 a36a16… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 2 68dc7a… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 3 a8951a… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 4 eb348e… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 5 2a4478… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 6 e48520… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 7 280c90… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 8 0e5060… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 9 4918ca… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 10 ef16cb… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## # … with 14,951 more rows, and 4 more variables: growthForm <chr>,
## # stemDiameter <dbl>, measurementHeight <dbl>, height <dbl>
dplyr::bind_rows
bind_rows
offers a shortcut to reducing a list of tibbles.
::bind_rows(indiv_df_list) dplyr
## # A tibble: 14,961 × 12
## uid namedLocation date eventID domainID siteID plotID individualID
## <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
## 1 a36a16… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 2 68dc7a… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 3 a8951a… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 4 eb348e… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 5 2a4478… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 6 e48520… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 7 280c90… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 8 0e5060… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 9 4918ca… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 10 ef16cb… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## # … with 14,951 more rows, and 4 more variables: growthForm <chr>,
## # stemDiameter <dbl>, measurementHeight <dbl>, height <dbl>
Functional programming
Loops are an important basic concept in programming. However another approach available in R is functional programming which vectorises a function or pipe of functions over given input(s). We’ve actually just been using functional programming with do.call
and reduce
.
This idea of passing a function to another function is one of the behaviours that makes R a functional programming language and is extremely powerful.
It allows us to:
- use functions rather than
for
loops to perform iteration over other functions. - wrap the code we want to iterate over in custom functions.
This iin turn allows us to replace many for
loops with code that is both more succinct and easier to read.
In base R there is a family of apply functions (lapply
, vapply
, sapply
, apply
, mapply
). These are handy to know if want to write workflows or software that are low on dependencies. However, I prefer using the functions in tidyverse package purrr
.
Iterating using purrr
In the tidyverse such functionality is provided by package purrr
, which provides a complete and consistent set of tools for working with functions and vectors of inputs.
The first thing we might try is to replace our for
loop with a function.
map
The basic purrr
function is map()
and it allows us to pass the elements of an input vector or list to a single argument of a function we want to repeat. It also has a handy shortcut for specifying the argument to pass the input object to.
<- purrr::map(individual_paths,
indiv_df_list ~readr::read_csv(file = .x))
The first argument to map
is the input vector of paths we want to iterate over. The next argument is a formula specifying the function we want to repeat as well as which argument the input is passed to.
Here we’re saying that we want to repeatedly run read_csv
and we indicate the argument we want the input passed to (file
) by .x
. Note as well the ~
notation before the function definition which is shorthand for .f =
.
map_df
Just like our loop, map
returns an output list.
class(indiv_df_list)
## [1] "list"
We would therefore need to combine them together in another step.
However, one of the great things about purrr
functions is that you can specify what you expect the output of the mapped function to be. THere are functions that take advantage of that knowledge and bind or format the outputs appropriately.
Because we know the output of read_csv()
is a tibble, we can use map_df()
instead of map()
.
<- purrr::map_df(individual_paths,
individual ~readr::read_csv(.x))
individual
## # A tibble: 14,961 × 12
## uid namedLocation date eventID domainID siteID plotID individualID
## <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
## 1 a36a16… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 2 68dc7a… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 3 a8951a… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 4 eb348e… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 5 2a4478… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 6 e48520… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 7 280c90… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 8 0e5060… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 9 4918ca… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 10 ef16cb… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## # … with 14,951 more rows, and 4 more variables: growthForm <chr>,
## # stemDiameter <dbl>, measurementHeight <dbl>, height <dbl>
Success! We now have all our data in a single tibble is just two concise lines of code!! 🎉 👏
Some tips on efficiency
While the above code is elegant, it might not be the most efficient. read_csv
calls readr
function type_convert()
to determine the data type for each column when it reads a file in, which is relatively expensive.
The elegant code above mean that type_convert()
is for every file that is loaded, ie 67 times.
A more efficent way of implementing this to set all columns as character on-read and then run type_convert
ourselves, only once, and only after our data have been combined into a single tibble.
We can set all columns to character by default by providing column formating function readr::cols(.default = "c"))
as the read_csv
col_types
argument.
<- purrr::map_df(individual_paths,
individual ~readr::read_csv(.x, col_types = readr::cols(.default = "c"))) %>%
::type_convert()
readr
individual
## # A tibble: 14,961 × 12
## uid namedLocation date eventID domainID siteID plotID individualID
## <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
## 1 a36a16… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 2 68dc7a… BART_037.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 3 a8951a… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 4 eb348e… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 5 2a4478… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 6 e48520… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 7 280c90… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 8 0e5060… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 9 4918ca… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## 10 ef16cb… BART_044.base… 2015-08-26 vst_BA… D01 BART BART_… NEON.PLA.D0…
## # … with 14,951 more rows, and 4 more variables: growthForm <chr>,
## # stemDiameter <dbl>, measurementHeight <dbl>, height <dbl>
This might come in handy if you are dealing with a huge number of data files.
Other packages to be aware of, especially if you are dealing with very large tables, are data.table
and vroom
.
Learn more about perfomance and efficency in general.
simple benchmark
::microbenchmark({
microbenchmark# tidyverse
::map_df(individual_paths,
purrr~readr::read_csv(.x))},
# tidyverse + read in as character
::map_df(individual_paths,
{purrr~readr::read_csv(.x, col_types = readr::cols(.default = "c"))) %>%
::type_convert()},
readr# vroom package
::vroom(individual_paths)},
{vroom# data.table
lapply(individual_paths, data.table::fread, sep=",") %>%
{do.call("rbind", .)},
# purrr + data.table
::map_df(individual_paths, data.table::fread, sep=",")},
{purrrtimes = 20)
min lq mean median uq max neval372.77828 389.98348 416.70189 395.01877 437.90033 512.19128 20
150.52621 164.60759 190.92799 175.99858 192.44910 322.73216 20
265.06628 272.09506 307.39295 285.53955 320.84952 518.55182 20
50.09148 53.22408 72.28593 58.26753 63.00565 182.50278 20
57.48761 58.76176 63.39054 63.31130 66.08371 75.92438 20
Writing out our tibble to disk
Remember the other two files included in our raw data, vst_mappingandtagging.csv
, and vst_perplotperyear.csv
? Well the truth is they also came in multiple files which I put together in pretty much the same way as you just did!
So for posterity, let’s save this file out too. This isn’t our finished analytic data set, we still have some processing to do. So let’s just save it at raw_data_path
, along with the other files.
To write out a csv file we use readr::write_csv()
%>%
individual ::write_csv(file.path(raw_data_path, "vst_individual.csv")) readr
Learn more about iteration and the family of purrr
functions in the iteration chapter in R for data science