# Indexing and subsetting

R has many powerful subset operators. Mastering them will allow you to easily perform complex operations on any kind of dataset.

There are many different ways we can subset any kind of object, and three different subsetting operators for the different data structures.

## Subsetting vectors

Let’s start by examining subsetting in the simplest data structure, the vector.

Subsetting a vector always returns another vector.

x <- 4:7
x
## [1] 4 5 6 7

### Subsetting using [ and elements indices

#### Extracting single elements

To extract elements of a vector we can use the square bracket operator ([) and the target element index,starting from one (as R is a 1 indexed language):

x[1]
## [1] 4
x[4]
## [1] 7

It may look different, but the square brackets operator is a function and means “get me the nth element”.

If we ask for an index beyond the length of the vector, R will return a missing value (NA):

x[6]
## [1] NA

If we ask for the 0th element, we get an empty vector:

x[0]
## integer(0)

#### Extracting multiple elements

We can also ask for multiple elements at once:

x[c(1, 3)]
## [1] 4 6

Or slices of the vector:

x[2:4]
## [1] 5 6 7

We can ask for the same element multiple times:

x[c(1,1,3)]
## [1] 4 4 6

#### Excluding and removing elements

If we use a negative number as the index of a vector, R will return every element except for the one specified:

x[-2]
## [1] 4 6 7

We can skip multiple elements:

x[c(-1, -5)]  # or x[-c(1,5)]
## [1] 5 6 7

In general, be aware that the result of subsetting using indices could change if the vector is reordered.

### Subsetting using element names

If the vector has a name attribute, we can subset the vector more precisely using the element’s name

names(x) <- c("a", "b", "c", "d")

x[c("a", "c")]
## a c
## 4 6

Subsetting using names in the most robust way to extract elements. The position of various elements can often change when chaining together subsetting operations, but the names will always remain the same!

### Subsetting using logical vectors

We can also use any logical vector to subset:

x[c(FALSE, FALSE, TRUE, TRUE)]
## c d
## 6 7

Since comparison operators (e.g. >, <, ==) evaluate to logical vectors, we can also use them to succinctly subset vectors: the following statement gives the same result as the previous one.

x[x > 5]
## c d
## 6 7

Breaking it down, this statement first evaluates x > 5, generating a logical vector c(FALSE, FALSE, TRUE, TRUE), and then selects the elements of x corresponding to the TRUE values.

We can use == to mimic the previous method of indexing by name (remember you have to use == rather than = for comparisons):

x[names(x) == "a"]
## a
## 4

Avoid using == to compare numbers unless they are integers! See function dplyr::near() instead.

We also might want to subset using a vector of potential values, that might not necessarily have matches in x.

In this case we can use %in%

x[names(x) %in% c("a", "c", "e")]
## a c
## 4 6

#### Excluding named elements

Excluding or removing named elements is a little harder.

If we try to skip one named element by negating the string, R complains (slightly obscurely) that it doesn’t know how to take the negative of a string:

x[-"a"]
## Error in -"a": invalid argument to unary operator

However, we can use the != (not-equals) operator to construct a logical vector that will do what we want:

x[names(x) != "a"]
## b c d
## 5 6 7

Excluding multiple named indices requires a different tactic through.

Suppose we want to drop the "a" and "c" elements, so we try this:

x[names(x) != c("a","c")]
## b c d
## 5 6 7

R did something, but it gave us a warning that we ought to pay attention to - and it apparently gave us the wrong answer (the "c" element is still included in the vector)!

This happens because we are trying to compare two vectors (names(x) and c("a","c")) and comparison operators are automatically vectorised in such a case. So in effect, R is comparing "a" in names(x) to "a" in c("a","c") and returning FALSE (ie "a" != "a" = FALSE), then "b" in names(x) to "c" in c("a","c") and returning TRUE. What happens with "c" in names(x) is R recycles the comparison vector c("a","c") and starts again with "a". "c" is not equal to "a" so "a" != "c" returns TRUE and the element is kept.

On the other hand this works, but only by chance:

x[names(x) != c("a","b")]
## c d
## 6 7

To perform such a subset robustly, we need to combine %in% and !.

x[!names(x) %in% c("a","c")]
## b d
## 5 7

This checks whether names of x take any value of the values in c("a","c"), returning the elements where the condition is TRUE. The ! then negates the selection, returning only the elements whose names are not contained in c("a","c").

## Matrix subsetting

As matrices are just 2d vectors, all the subsetting operations using the [ can also be applied to matrices.

### Subsetting using element indices

Let’s create a matrix

m <- matrix(1:12, ncol=4, nrow=3)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Indexing matrices with [ takes two arguments: the first expression is applied to the rows, the second to the columns:

Say we want the 2 and 3rd rows of the last and first column (in that order) of our matrix. We can use all the subsetting we learned for vectors and apply them to each dimension of our matrix.

m[2:3, c(4,1)]
##      [,1] [,2]
## [1,]   11    2
## [2,]   12    3

#### Subsetting whole rows or columns

We can leave the first or second arguments blank to retrieve all the rows or columns respectively:

m[, c(2,3)]
##      [,1] [,2]
## [1,]    4    7
## [2,]    5    8
## [3,]    6    9
m[c(2,3),]
##      [,1] [,2] [,3] [,4]
## [1,]    2    5    8   11
## [2,]    3    6    9   12

If we only access one row or column, R will automatically convert the result to a vector:

m[3,]
## [1]  3  6  9 12

If we want to keep the output as a matrix, we need to specify a third argument; drop = FALSE:

m[3, , drop=FALSE]
##      [,1] [,2] [,3] [,4]
## [1,]    3    6    9   12

Tip: Higher dimensional arrays

When dealing with multi-dimensional arrays, each argument to [ corresponds to a dimension. For example, a 3D array, the first three arguments correspond to the rows, columns, and depth dimension.

## Subsetting lists

There are three functions used to subset lists and extract individual elements: [, [[, and $. ### Subsetting list elements Using [ will always return a list. If you want to subset a list, but not extract an element, then you will likely use [. xlist <- list(a = "ACCE DTP", b = 1:10, data = head(iris)) #### Subsetting by element indices As with vectors, we can use element indices and [ to subset lists. xlist[1] ##$a
## [1] "ACCE DTP"

This returns a list with one element.

We can use multiple indices to subset multiple list elements:

xlist[1:2]
## $a ## [1] "ACCE DTP" ## ##$b
##  [1]  1  2  3  4  5  6  7  8  9 10

#### Subsetting by name

We can also use names:

xlist[c("a", "b")]
## $a ## [1] "ACCE DTP" ## ##$b
##  [1]  1  2  3  4  5  6  7  8  9 10

It is accessing the list as if it were a vector and returning a list.

Comparison operations involving the contents of list elements however won’t work as they are not accessible at the level of [ indexing.

### Extracting individual elements

Extracting individual elements allow us to access the objects contained in a list, which can be any type of object. Hence the result depends on the object each element contains.

To extract individual elements of a list, we use the double-square bracket function: [[.

#### Extracting by element index

Again we can use element indices to extract the object contained in an element.

xlist[[2]]
##  [1]  1  2  3  4  5  6  7  8  9 10

Notice that now the result is a vector, not a list, which is what the second element contained.

You can’t extract more than one element at once:

xlist[[1:2]]
## Error in xlist[[1:2]]: subscript out of bounds

Nor use it to skip elements:

xlist[[-1]]
## Error in xlist[[-1]]: invalid negative subscript in get1index <real>

#### Extracting by element name

We can however use single names to extract elements:

xlist[["a"]]
## [1] "ACCE DTP"

xlist$data ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ##### List subsetting challenge Given the following list: xlist <- list(a = "ACCE DTP", b = 1:10, data = head(iris)) and using your knowledge of both list and vector subsetting, extract the number 2 from xlist. Hint: the number 2 is contained within the “b” item in the list. Solution ## Subsetting data.frames Data frames are lists underneath the hood, so similar rules apply subsetting rules apply. However they are also two dimensional objects. ### Subsetting data.frames as a list #### Using [ to subset Using the [ operator with one argument will act the same way as for lists, where each list element corresponds to a column. The resulting object will be a data.frame: trees[1] ## Girth ## 1 8.3 ## 2 8.6 ## 3 8.8 ## 4 10.5 ## 5 10.7 ## 6 10.8 ## 7 11.0 ## 8 11.0 ## 9 11.1 ## 10 11.2 ## 11 11.3 ## 12 11.4 ## 13 11.4 ## 14 11.7 ## 15 12.0 ## 16 12.9 ## 17 12.9 ## 18 13.3 ## 19 13.7 ## 20 13.8 ## 21 14.0 ## 22 14.2 ## 23 14.5 ## 24 16.0 ## 25 16.3 ## 26 17.3 ## 27 17.5 ## 28 17.9 ## 29 18.0 ## 30 18.0 ## 31 20.6 trees["Girth"] ## Girth ## 1 8.3 ## 2 8.6 ## 3 8.8 ## 4 10.5 ## 5 10.7 ## 6 10.8 ## 7 11.0 ## 8 11.0 ## 9 11.1 ## 10 11.2 ## 11 11.3 ## 12 11.4 ## 13 11.4 ## 14 11.7 ## 15 12.0 ## 16 12.9 ## 17 12.9 ## 18 13.3 ## 19 13.7 ## 20 13.8 ## 21 14.0 ## 22 14.2 ## 23 14.5 ## 24 16.0 ## 25 16.3 ## 26 17.3 ## 27 17.5 ## 28 17.9 ## 29 18.0 ## 30 18.0 ## 31 20.6 #### Using [[ to extract Similarly, [[ will act to extract a single column as a vector: trees[[1]] ## [1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0 ## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0 ## [31] 20.6 trees[["Girth"]] ## [1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0 ## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0 ## [31] 20.6 And $ provides a convenient shorthand to extract columns by name:

trees\$Girth
##  [1]  8.3  8.6  8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
## [31] 20.6

### Subsetting data.frames as a matrix

With two arguments, [ behaves the same way as for matrices:

trees[1:5, c("Girth", "Volume")]
##   Girth Volume
## 1   8.3   10.3
## 2   8.6   10.3
## 3   8.8   10.2
## 4  10.5   16.4
## 5  10.7   18.8

If we subset a single row, the result will be a data.frame (because the elements are mixed types):

trees[3,]
##   Girth Height Volume
## 3   8.8     63   10.2

But for a single column the result will be a vector.

trees[, "Girth"]
##  [1]  8.3  8.6  8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
## [31] 20.6

This can be changed with the third argument, drop = FALSE).

trees[, "Girth", drop=FALSE]
##    Girth
## 1    8.3
## 2    8.6
## 3    8.8
## 4   10.5
## 5   10.7
## 6   10.8
## 7   11.0
## 8   11.0
## 9   11.1
## 10  11.2
## 11  11.3
## 12  11.4
## 13  11.4
## 14  11.7
## 15  12.0
## 16  12.9
## 17  12.9
## 18  13.3
## 19  13.7
## 20  13.8
## 21  14.0
## 22  14.2
## 23  14.5
## 24  16.0
## 25  16.3
## 26  17.3
## 27  17.5
## 28  17.9
## 29  18.0
## 30  18.0
## 31  20.6