Indexing and subsetting
R has many powerful subset operators. Mastering them will allow you to easily perform complex operations on any kind of dataset.
There are many different ways we can subset any kind of object, and three different subsetting operators for the different data structures.
Subsetting vectors
Let’s start by examining subsetting in the simplest data structure, the vector.
Subsetting a vector always returns another vector.
<- 4:7
x x
## [1] 4 5 6 7
Subsetting using [
and elements indices
Extracting single elements
To extract elements of a vector we can use the square bracket operator ([
) and the target element index,starting from one (as R is a 1 indexed language):
1] x[
## [1] 4
4] x[
## [1] 7
It may look different, but the square brackets operator is a function and means “get me the nth element”.
If we ask for an index beyond the length of the vector, R will return a missing value (NA
):
6] x[
## [1] NA
If we ask for the 0th element, we get an empty vector:
0] x[
## integer(0)
Extracting multiple elements
We can also ask for multiple elements at once:
c(1, 3)] x[
## [1] 4 6
Or slices of the vector:
2:4] x[
## [1] 5 6 7
We can ask for the same element multiple times:
c(1,1,3)] x[
## [1] 4 4 6
Excluding and removing elements
If we use a negative number as the index of a vector, R will return every element except for the one specified:
-2] x[
## [1] 4 6 7
We can skip multiple elements:
c(-1, -5)] # or x[-c(1,5)] x[
## [1] 5 6 7
In general, be aware that the result of subsetting using indices could change if the vector is reordered.
Subsetting using element names
If the vector has a name attribute, we can subset the vector more precisely using the element’s name
names(x) <- c("a", "b", "c", "d")
c("a", "c")] x[
## a c
## 4 6
Subsetting using names in the most robust way to extract elements. The position of various elements can often change when chaining together subsetting operations, but the names will always remain the same!
Subsetting using logical vectors
We can also use any logical vector to subset:
c(FALSE, FALSE, TRUE, TRUE)] x[
## c d
## 6 7
Since comparison operators (e.g. >
, <
, ==
) evaluate to logical vectors, we can also
use them to succinctly subset vectors: the following statement gives
the same result as the previous one.
> 5] x[x
## c d
## 6 7
Breaking it down, this statement first evaluates x > 5
, generating
a logical vector c(FALSE, FALSE, TRUE, TRUE)
, and then
selects the elements of x
corresponding to the TRUE
values.
We can use ==
to mimic the previous method of indexing by name
(remember you have to use ==
rather than =
for comparisons):
names(x) == "a"] x[
## a
## 4
Avoid using ==
to compare numbers unless they are integers! See function dplyr::near()
instead.
We also might want to subset using a vector of potential values, that might not necessarily have matches in x
.
In this case we can use %in%
names(x) %in% c("a", "c", "e")] x[
## a c
## 4 6
Excluding named elements
Excluding or removing named elements is a little harder.
If we try to skip one named element by negating the string, R complains (slightly obscurely) that it doesn’t know how to take the negative of a string:
-"a"] x[
## Error in -"a": invalid argument to unary operator
However, we can use the !=
(not-equals) operator to construct a logical vector that will do what we want:
names(x) != "a"] x[
## b c d
## 5 6 7
Excluding multiple named indices requires a different tactic through.
Suppose we want to drop the "a"
and "c"
elements, so we try this:
names(x) != c("a","c")] x[
## b c d
## 5 6 7
R did something, but it gave us a warning that we ought to pay attention to - and it apparently gave us the wrong answer (the "c"
element is still included in the vector)!
This happens because we are trying to compare two vectors (names(x)
and c("a","c")
) and comparison operators are automatically vectorised in such a case. So in effect, R is comparing "a"
in names(x)
to "a"
in c("a","c")
and returning FALSE
(ie "a" != "a" = FALSE
), then "b"
in names(x)
to "c"
in c("a","c")
and returning TRUE
. What happens with "c"
in names(x)
is R recycles the comparison vector c("a","c")
and starts again with "a"
. "c"
is not equal to "a"
so "a" != "c"
returns TRUE
and the element is kept.
On the other hand this works, but only by chance:
names(x) != c("a","b")] x[
## c d
## 6 7
To perform such a subset robustly, we need to combine %in%
and !
.
!names(x) %in% c("a","c")] x[
## b d
## 5 7
This checks whether names of x
take any value of the values in c("a","c")
, returning the elements where the condition is TRUE
. The !
then negates the selection, returning only the elements whose names are not contained in c("a","c")
.
Matrix subsetting
As matrices are just 2d vectors, all the subsetting operations using the
[
can also be applied to matrices.
Subsetting using element indices
Let’s create a matrix
<- matrix(1:12, ncol=4, nrow=3)
m m
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Indexing matrices with [
takes two arguments: the first expression is applied to the rows, the second
to the columns:
Say we want the 2 and 3rd rows of the last and first column (in that order) of our matrix. We can use all the subsetting we learned for vectors and apply them to each dimension of our matrix.
2:3, c(4,1)] m[
## [,1] [,2]
## [1,] 11 2
## [2,] 12 3
Subsetting whole rows or columns
We can leave the first or second arguments blank to retrieve all the rows or columns respectively:
c(2,3)] m[,
## [,1] [,2]
## [1,] 4 7
## [2,] 5 8
## [3,] 6 9
c(2,3),] m[
## [,1] [,2] [,3] [,4]
## [1,] 2 5 8 11
## [2,] 3 6 9 12
If we only access one row or column, R will automatically convert the result to a vector:
3,] m[
## [1] 3 6 9 12
If we want to keep the output as a matrix, we need to specify a third argument;
drop = FALSE
:
3, , drop=FALSE] m[
## [,1] [,2] [,3] [,4]
## [1,] 3 6 9 12
Tip: Higher dimensional arrays
When dealing with multi-dimensional arrays, each argument to [
corresponds to a dimension. For example, a 3D array, the first three arguments correspond to the rows, columns, and depth dimension.
Subsetting lists
There are three functions used to subset lists and extract individual elements:
[
,[[
, and$
.
Subsetting list elements
Using [
will always return a list. If you want to subset a list, but not
extract an element, then you will likely use [
.
<- list(a = "ACCE DTP", b = 1:10, data = head(iris)) xlist
Subsetting by element indices
As with vectors, we can use element indices and [
to subset lists.
1] xlist[
## $a
## [1] "ACCE DTP"
This returns a list with one element.
We can use multiple indices to subset multiple list elements:
1:2] xlist[
## $a
## [1] "ACCE DTP"
##
## $b
## [1] 1 2 3 4 5 6 7 8 9 10
Subsetting by name
We can also use names:
c("a", "b")] xlist[
## $a
## [1] "ACCE DTP"
##
## $b
## [1] 1 2 3 4 5 6 7 8 9 10
It is accessing the list as if it were a vector and returning a list.
Comparison operations involving the contents of list elements however won’t work as they are not accessible at the level of [
indexing.
Extracting individual elements
Extracting individual elements allow us to access the objects contained in a list, which can be any type of object. Hence the result depends on the object each element contains.
To extract individual elements of a list, we use the double-square bracket function: [[
.
Extracting by element index
Again we can use element indices to extract the object contained in an element.
2]] xlist[[
## [1] 1 2 3 4 5 6 7 8 9 10
Notice that now the result is a vector, not a list, which is what the second element contained.
You can’t extract more than one element at once:
1:2]] xlist[[
## Error in xlist[[1:2]]: subscript out of bounds
Nor use it to skip elements:
-1]] xlist[[
## Error in xlist[[-1]]: invalid negative subscript in get1index <real>
Extracting by element name
We can however use single names to extract elements:
"a"]] xlist[[
## [1] "ACCE DTP"
The $
operator
The $
operator is a shorthand way for extracting single elements by name:
$data xlist
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
List subsetting challenge
Given the following list:
<- list(a = "ACCE DTP", b = 1:10, data = head(iris)) xlist
and using your knowledge of both list and vector subsetting, extract the number 2 from xlist.
Hint: the number 2 is contained within the “b” item in the list.
Subsetting data.frames
Data frames are lists underneath the hood, so similar rules apply subsetting rules apply. However they are also two dimensional objects.
Subsetting data.frames as a list
Using [
to subset
Using the [
operator with one argument will act the same way as for lists, where each list
element corresponds to a column. The resulting object will be a data.frame:
1] trees[
## Girth
## 1 8.3
## 2 8.6
## 3 8.8
## 4 10.5
## 5 10.7
## 6 10.8
## 7 11.0
## 8 11.0
## 9 11.1
## 10 11.2
## 11 11.3
## 12 11.4
## 13 11.4
## 14 11.7
## 15 12.0
## 16 12.9
## 17 12.9
## 18 13.3
## 19 13.7
## 20 13.8
## 21 14.0
## 22 14.2
## 23 14.5
## 24 16.0
## 25 16.3
## 26 17.3
## 27 17.5
## 28 17.9
## 29 18.0
## 30 18.0
## 31 20.6
"Girth"] trees[
## Girth
## 1 8.3
## 2 8.6
## 3 8.8
## 4 10.5
## 5 10.7
## 6 10.8
## 7 11.0
## 8 11.0
## 9 11.1
## 10 11.2
## 11 11.3
## 12 11.4
## 13 11.4
## 14 11.7
## 15 12.0
## 16 12.9
## 17 12.9
## 18 13.3
## 19 13.7
## 20 13.8
## 21 14.0
## 22 14.2
## 23 14.5
## 24 16.0
## 25 16.3
## 26 17.3
## 27 17.5
## 28 17.9
## 29 18.0
## 30 18.0
## 31 20.6
Using [[
to extract
Similarly, [[
will act to extract a single column as a vector:
1]] trees[[
## [1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
## [31] 20.6
"Girth"]] trees[[
## [1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
## [31] 20.6
And $
provides a convenient shorthand to extract columns by name:
$Girth trees
## [1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
## [31] 20.6
Subsetting data.frames as a matrix
With two arguments, [
behaves the same way as for matrices:
1:5, c("Girth", "Volume")] trees[
## Girth Volume
## 1 8.3 10.3
## 2 8.6 10.3
## 3 8.8 10.2
## 4 10.5 16.4
## 5 10.7 18.8
If we subset a single row, the result will be a data.frame (because the elements are mixed types):
3,] trees[
## Girth Height Volume
## 3 8.8 63 10.2
But for a single column the result will be a vector.
"Girth"] trees[,
## [1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
## [31] 20.6
This can be changed with the third argument, drop = FALSE
).
"Girth", drop=FALSE] trees[,
## Girth
## 1 8.3
## 2 8.6
## 3 8.8
## 4 10.5
## 5 10.7
## 6 10.8
## 7 11.0
## 8 11.0
## 9 11.1
## 10 11.2
## 11 11.3
## 12 11.4
## 13 11.4
## 14 11.7
## 15 12.0
## 16 12.9
## 17 12.9
## 18 13.3
## 19 13.7
## 20 13.8
## 21 14.0
## 22 14.2
## 23 14.5
## 24 16.0
## 25 16.3
## 26 17.3
## 27 17.5
## 28 17.9
## 29 18.0
## 30 18.0
## 31 20.6