# install.packages("dplyr")
NB: Dplyr
Programming for Data Science
What is Dplyr?
The Dplyr package is a cetnral component of the Tidyverse.
Technically, Dplyr is an optimized and distilled version of the previous Plyr package.
Plyr organizes much of the functionality of the family of apply functions in base R — functions like apply()
, lapply()
, sapply()
, and tapply()
.
These functions perform the same action on multiple chunks of data in a repetitive way.
For example, apply()
applies a function to every element of a row or column in a dataset.
Why Dplyr?
Dplyr was developed to provide a clean and intelligble way to process data tables.
It does this by providing a simple grammar for data manipulation and for operating on data frames.
With this grammar, you can clearly express and communicate what it is that you are doing to a data frame that other people can understand.
This is useful because it provides an abstraction for data manipulation that previously did not exist within R and other programming languages.
Another useful contribution is that the Dplyr functions are very fast, as many key operations are coded in C++.
Note that Dplyr does not provide any new functionality to R per se.
Everything Dplyr does may already be done with base R.
But it greatly simplifies existing functionality in R.
In the table below, you can see that that Dplyr provides a consistent way of phrasing the someone chaotic idioms of base R.
dplyr | base R |
---|---|
arrange(df, x) |
df[order(x), , drop = FALSE] |
distinct(df, x) |
df[!duplicated(x), , drop = FALSE], unique() |
filter(df, x) |
df[which(x), , drop = FALSE], subset() |
mutate(df, z = x + y) |
df$z <- df$x + df$y, transform() |
pull(df, 1) |
df[[1]] |
pull(df, x) |
df$x |
rename(df, y = x) |
names(df)[names(df) == "x"] <- "y" |
relocate(df, y) |
df[union("y", names(df))] |
select(df, x, y) |
df[c("x", "y")], subset() |
select(df, starts_with("x")) |
df[grepl(" |
summarise(df, mean(x)) |
mean(df$x), tapply(), aggregate(), by() |
slice(df, c(1, 2, 5)) |
df[c(1, 2, 5), , drop = FALSE] |
A Grammar of Data
Some of the key “verbs” provided by the Dplyr package are:
select
: return a subset of the columns of a data frame, using a flexible notation.filter
: extract a subset of rows from a data frame based on logical conditions.arrange
: reorder rows of a. data frame.rename
: rename variables in a data frame.mutate
: add new variables/columns or transform existing variables.summarise
/summarize
: generate summary statistics of different variables in the data frame, possibly within strata.
%>%
: the pipe operator is used to connect multiple verb actions together into a pipeline.
Common Function Properties
All Dplyr functions have some common characteristics:
- The first argument is a data frame (or tibble).
- The succeeding arguments describe what to do with the data frame specified in the first argument.
- The return result of a function is a new data frame.
In addition, Dplyr expects data frames to be be properly structured as tidy data.
Installing Dplyr
You can install Dplyr directly or as part of Tidyverse, which we installed in the previous notebook.
Once the package is installed, you may load it into your R session by calling the library()
function.
# library(dplyr)
Of course, you can also just install tidyverse
if you plan to visualize your results.
Since we did this in our previous notebook, we can import it now.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
You may get some warnings when the package is loaded because there are functions in the dplyr
package that have the same name as functions in other packages. For now you can ignore the warnings.
Getting a dataset
Let’s use a dataset containing air pollution and temperature data for the city of Chicago in the U.S.
You will find the data in the directory for this module.
We load the data into R using the readRDS()
function and then convert it to a tibble.
<- readRDS("chicago.rds") %>% as_tibble() chicago
We can see some basic characteristics of the dataset with the print()
function.
print(chicago)
# A tibble: 6,940 × 8
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
<chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0
2 chic 33 29.9 1987-01-02 NA NA 3.30 23.2
3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8
4 chic 29 28.6 1987-01-04 NA 47 4.38 30.4
5 chic 32 28.9 1987-01-05 NA NA 4.75 30.3
6 chic 40 35.1 1987-01-06 NA 48 5.83 25.8
7 chic 34.5 26.8 1987-01-07 NA 41 9.29 20.6
8 chic 29 22 1987-01-08 NA 36 11.3 17.0
9 chic 26.5 29 1987-01-09 NA 33.3 4.5 23.4
10 chic 32.5 27.8 1987-01-10 NA NA 4.96 19.5
# ℹ 6,930 more rows
Dplyr Functions in Depth
Let’s use this tibble to explore each of the Dplyr verbs in a little more depth.
select()
We use the select()
function to select columns you want to focus on.
For example, we can inspect the tmpd
column like so:
%>% select(tmpd) %>% head() chicago
tmpd |
---|
<dbl> |
31.5 |
33.0 |
33.0 |
29.0 |
32.0 |
40.0 |
In base R, we would do this:
head(chicago["tmpd"])
tmpd |
---|
<dbl> |
31.5 |
33.0 |
33.0 |
29.0 |
32.0 |
40.0 |
Now suppose we wanted to take the first \(3\) columns only.
We could use numerical indices in base R.
head(chicago[1:3])
city | tmpd | dewpoint |
---|---|---|
<chr> | <dbl> | <dbl> |
chic | 31.5 | 31.500 |
chic | 33.0 | 29.875 |
chic | 33.0 | 27.375 |
chic | 29.0 | 28.625 |
chic | 32.0 | 28.875 |
chic | 40.0 | 35.125 |
In Dplr, we can use the names directly to create a column range slice:
%>%
chicago select(city:dptp) %>%
head()
city | tmpd | dptp |
---|---|---|
<chr> | <dbl> | <dbl> |
chic | 31.5 | 31.500 |
chic | 33.0 | 29.875 |
chic | 33.0 | 27.375 |
chic | 29.0 | 28.625 |
chic | 32.0 | 28.875 |
chic | 40.0 | 35.125 |
Note the use of :
inside the select()
— you can use it to specify a range of variable names.
You can also omit variables using the select()
function by using the negative sign.
With select()
you can do:
%>%
chicago select(-(city:dptp)) %>%
head()
date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|
<date> | <dbl> | <dbl> | <dbl> | <dbl> |
1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 |
1987-01-02 | NA | NA | 3.304348 | 23.19099 |
1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 |
1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 |
1987-01-05 | NA | NA | 4.750000 | 30.33333 |
1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 |
This indicates that we should include every variable except the variables city
through dptp
.
The select()
function also allows a special syntax that allows you to specify variable names based on patterns.
If you wanted to keep every variable that ends with a “2”, we could do this:
%>%
chicago select(ends_with("2")) %>%
head()
pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> |
NA | 34.00000 | 4.250000 | 19.98810 |
NA | NA | 3.304348 | 23.19099 |
NA | 34.16667 | 3.333333 | 23.81548 |
NA | 47.00000 | 4.375000 | 30.43452 |
NA | NA | 4.750000 | 30.33333 |
NA | 48.00000 | 5.833333 | 25.77233 |
Or if we wanted to keep every variable that starts with a “d”, we could do this:
%>%
chicago select(starts_with("d")) %>%
head()
dptp | date |
---|---|
<dbl> | <date> |
31.500 | 1987-01-01 |
29.875 | 1987-01-02 |
27.375 | 1987-01-03 |
28.625 | 1987-01-04 |
28.875 | 1987-01-05 |
35.125 | 1987-01-06 |
This is super useful if you have a table with prefixes, such as doc_
.
You can also use more general regular expressions if necessary.
Aside: Helper Functions
Note the use of the function ends_with()
inside of the argument space for the function select()
above.
ends_with()
is an example of a “helper function” — a function that helps make Dplyr functions more effective.
There are many of these introduced by Dplyr.
Here is a set of helper functions that can be especially useful with select()
:
Helper Function | Description | Example |
---|---|---|
starts_with() |
Matches variable names that start with a given prefix | select(data, starts_with("X")) |
ends_with() |
Matches variable names that end with a given suffix | select(data, ends_with("_2019")) |
contains() |
Matches variable names that contain a given string | select(data, contains("income")) |
matches() |
Matches variable names that match a given regular expression | select(data, matches("^X[1-3]$")) |
num_range() |
Matches variable names that match a pattern of prefix + number | select(data, num_range("X", 1:3)) |
one_of() |
Matches variable names that are in a given set of names | select(data, one_of(c("var1", "var2"))) |
everything() |
Matches all variable names (used to reorder variables) | select(data, starts_with("X"), everything()) |
filter()
The filter()
function extracts subsets of rows from a data frame.
Whereas select()
is a column-wise operation, filter()
is row-wise.
This function is similar to the existing subset()
function in R but is quite a bit faster.
Suppose we wanted to extract the rows of the chicago
data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do this:
%>%
chicago filter(pm25tmean2 > 30) %>%
head()
city | tmpd | dptp | date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> |
chic | 23 | 21.9 | 1998-01-17 | 38.10 | 32.46154 | 3.180556 | 25.30000 |
chic | 28 | 25.8 | 1998-01-23 | 33.95 | 38.69231 | 1.750000 | 29.37630 |
chic | 55 | 51.3 | 1998-04-30 | 39.40 | 34.00000 | 10.786232 | 25.31310 |
chic | 59 | 53.7 | 1998-05-01 | 35.40 | 28.50000 | 14.295125 | 31.42905 |
chic | 57 | 52.0 | 1998-05-02 | 33.30 | 35.00000 | 20.662879 | 26.79861 |
chic | 57 | 56.0 | 1998-05-07 | 32.10 | 34.50000 | 24.270422 | 33.99167 |
Compare this to base R:
head(na.omit(chicago[(chicago$pm25tmean2 > 30),]))
city | tmpd | dptp | date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> |
chic | 23 | 21.9 | 1998-01-17 | 38.10 | 32.46154 | 3.180556 | 25.30000 |
chic | 28 | 25.8 | 1998-01-23 | 33.95 | 38.69231 | 1.750000 | 29.37630 |
chic | 55 | 51.3 | 1998-04-30 | 39.40 | 34.00000 | 10.786232 | 25.31310 |
chic | 59 | 53.7 | 1998-05-01 | 35.40 | 28.50000 | 14.295125 | 31.42905 |
chic | 57 | 52.0 | 1998-05-02 | 33.30 | 35.00000 | 20.662879 | 26.79861 |
chic | 57 | 56.0 | 1998-05-07 | 32.10 | 34.50000 | 24.270422 | 33.99167 |
Note that we have to filter out the NA
values, which Dplyr does for you.
Note also that column names in Dply do not need to be prefixed by the data frame name; they are called directly.
For example, compare how the column pm25tmean2
is accessed below:
%>% filter(pm25tmean2 > 30) chicago
vs
$pm25tmean2 > 30),] chicago[(chicago
We can place an arbitrarily complex logical sequence inside of filter()
.
For example, we may extract the rows where PM2.5 is greater than \(30\) and temperature is greater than \(80\) degrees Fahrenheit.
%>%
chicago filter(pm25tmean2 > 30 & tmpd > 80) %>%
select(date, tmpd, pm25tmean2) %>%
head()
date | tmpd | pm25tmean2 |
---|---|---|
<date> | <dbl> | <dbl> |
1998-08-23 | 81 | 39.6000 |
1998-09-06 | 81 | 31.5000 |
2001-07-20 | 82 | 32.3000 |
2001-08-01 | 84 | 43.7000 |
2001-08-08 | 85 | 38.8375 |
2001-08-09 | 84 | 38.2000 |
Here’s a table of helper functions specifically associated with filter()
:
Helper Function | Description | Example |
---|---|---|
between() |
Check if a value lies between two other values. | filter(data, between(age, 18, 30)) |
near() |
Check for near equality (useful for floating-point numbers). | filter(data, near(price, 19.99, tolerance = 0.01)) |
isin() (or %in% ) |
Check if a value is in a set of values. | filter(data, region %in% c("East", "West")) |
all_of() |
Check for all of the variables, even if some don’t exist. Useful with character vectors. | filter(data, all_of(variables)) |
any_of() |
Check for any of the variables. Useful with character vectors. | filter(data, any_of(variables)) |
cur_group_id() |
Identify the current group inside group_by() . Useful for more complex filtering with groups. |
filter(data, cur_group_id() == 1) |
arrange()
We use arrange()
to reorder rows of a data frame according to one or more of the columns.
This normally called sorting.
Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R.
The arrange()
function simplifies the process quite a bit.
Here we order the rows of the data frame by date in ascending order:
<- chicago %>%
chicago_by_date arrange(date)
%>% head() chicago_by_date
city | tmpd | dptp | date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> |
chic | 31.5 | 31.500 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 |
chic | 33.0 | 29.875 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 |
chic | 33.0 | 27.375 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 |
chic | 29.0 | 28.625 | 1987-01-04 | NA | 47.00000 | 4.375000 | 30.43452 |
chic | 32.0 | 28.875 | 1987-01-05 | NA | NA | 4.750000 | 30.33333 |
chic | 40.0 | 35.125 | 1987-01-06 | NA | 48.00000 | 5.833333 | 25.77233 |
We can now check the first few rows
%>%
chicago_by_date select(date, pm25tmean2) %>%
head(3)
date | pm25tmean2 |
---|---|
<date> | <dbl> |
1987-01-01 | NA |
1987-01-02 | NA |
1987-01-03 | NA |
and the last few rows.
%>%
chicago_by_date select(date, pm25tmean2) %>%
tail(3)
date | pm25tmean2 |
---|---|
<date> | <dbl> |
2005-12-29 | 7.45000 |
2005-12-30 | 15.05714 |
2005-12-31 | 15.00000 |
Columns can be arranged in descending order too by using the desc()
helper function.
%>%
chicago arrange(desc(date)) %>%
head()
city | tmpd | dptp | date | pm25tmean2 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> |
chic | 35 | 30.1 | 2005-12-31 | 15.00000 | 23.5 | 2.531250 | 13.25000 |
chic | 36 | 31.0 | 2005-12-30 | 15.05714 | 19.2 | 3.034420 | 22.80556 |
chic | 35 | 29.4 | 2005-12-29 | 7.45000 | 23.5 | 6.794837 | 19.97222 |
chic | 37 | 34.5 | 2005-12-28 | 17.75000 | 27.5 | 3.260417 | 19.28563 |
chic | 40 | 33.6 | 2005-12-27 | 23.56000 | 27.0 | 4.468750 | 23.50000 |
chic | 35 | 29.6 | 2005-12-26 | 8.40000 | 8.5 | 14.041667 | 16.81944 |
Here’s a table of helper functions specifically associated with arrange()
:
Helper Function | Description | Example |
---|---|---|
desc() |
Order by descending instead of default ascending. | arrange(data, desc(salary)) |
across() |
Apply to multiple columns, especially with c_across() in dplyr 1.0.0 and later. |
arrange(data, across(starts_with("X"))) |
row_number() |
Create an index based on the order of rows. | mutate(data, idx = row_number()) |
Note that row_number()
is often seen in mutate()
for creating an index column based on ordering.
The across()
function provides a more concise syntax for working with multiple columns.
It is useful in arrange()
when you’re trying to order rows based on a selection of columns that match a certain criterion.
Most of the magic with arrange()
happens through the way you structure its arguments and not necessarily through a large set of helper functions.
rename()
Renaming a variable in a data frame in R is surprisingly hard to do!
The rename()
function is designed to make this process easier.
Here we select the names of the first five variables in the chicago
data frame.
%>% select(1:5) %>% head(3) chicago
city | tmpd | dptp | date | pm25tmean2 |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <date> | <dbl> |
chic | 31.5 | 31.500 | 1987-01-01 | NA |
chic | 33.0 | 29.875 | 1987-01-02 | NA |
chic | 33.0 | 27.375 | 1987-01-03 | NA |
The dptp
column represents the dew point temperature and the pm25tmean2
column the PM2.5 data.
These names are cryptic, so let’s rename them.
<- chicago %>%
chicago rename(dewpoint = dptp, pm25 = pm25tmean2)
%>%
chicago head(3)
city | tmpd | dewpoint | date | pm25 | pm10tmean2 | o3tmean2 | no2tmean2 |
---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> |
chic | 31.5 | 31.500 | 1987-01-01 | NA | 34.00000 | 4.250000 | 19.98810 |
chic | 33.0 | 29.875 | 1987-01-02 | NA | NA | 3.304348 | 23.19099 |
chic | 33.0 | 27.375 | 1987-01-03 | NA | 34.16667 | 3.333333 | 23.81548 |
The syntax inside the rename()
function is to have the new name on the left-hand side of the =
sign and the old name on the right-hand side.
Here’s a table of helper functions that can be used with rename()
:
Helper Function | Description | Example |
---|---|---|
starts_with() |
Match columns that start with a given prefix. | rename(data, start_X = starts_with("X")) (Note: This example would rename the first column that starts with “X” to “start_X”) |
ends_with() |
Match columns that end with a given suffix. | rename(data, end_year = ends_with("_2021")) |
contains() |
Match columns that contain a given string. | rename(data, has_rate = contains("rate")) |
matches() |
Match columns that match a given regular expression. | rename(data, match_num = matches("^X[1-3]$")) |
everything() |
Match all columns. | Rarely used with rename() , but possible. |
It’s worth noting that these helper functions will typically target the first column that meets the specified criterion.
If there are multiple columns that match the criterion (e.g., multiple columns that start with “X”), you’ll need to use these helpers with more precision or utilize other techniques.
Also, remember to always refer to the official Dplyr documentation as the package continues to evolve and new functionalities might be added over time.
mutate()
The mutate()
function creates new columns based on computations usually involving the other columns.
For example, with air pollution data, we often want to de-trend the data by subtracting the mean from the data.
This allows us to see whether a given day’s air pollution level is higher than or less than average.
Here we create a pm25detrend
variable that subtracts the mean from the pm25
variable.
We also compute the Z-score.
<- chicago %>%
chicago mutate(pm25detrend = pm25 - mean(pm25, na.rm = TRUE),
pm25z = pm25detrend / sd(pm25, na.rm = TRUE))
%>% tail() chicago
city | tmpd | dewpoint | date | pm25 | pm10tmean2 | o3tmean2 | no2tmean2 | pm25detrend | pm25z |
---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
chic | 35 | 29.6 | 2005-12-26 | 8.40000 | 8.5 | 14.041667 | 16.81944 | -7.830958 | -0.9003429 |
chic | 40 | 33.6 | 2005-12-27 | 23.56000 | 27.0 | 4.468750 | 23.50000 | 7.329042 | 0.8426366 |
chic | 37 | 34.5 | 2005-12-28 | 17.75000 | 27.5 | 3.260417 | 19.28563 | 1.519042 | 0.1746477 |
chic | 35 | 29.4 | 2005-12-29 | 7.45000 | 23.5 | 6.794837 | 19.97222 | -8.780958 | -1.0095666 |
chic | 36 | 31.0 | 2005-12-30 | 15.05714 | 19.2 | 3.034420 | 22.80556 | -1.173815 | -0.1349561 |
chic | 35 | 30.1 | 2005-12-31 | 15.00000 | 23.5 | 2.531250 | 13.25000 | -1.230958 | -0.1415260 |
Note how the new column name pm25detrend
can be used immediately within the argument space of mutate()
.
Dplyr also has a related transmute()
function.
It does the same thing as mutate()
but then drops all non-transformed variables.
Here we de-trend the PM10 and ozone (O3) variables.
<- chicago %>%
chicago_detrended transmute(pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE),
o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE),
o3z = o3detrend / sd(o3tmean2, na.rm = TRUE))
%>% head() chicago_detrended
pm10detrend | o3detrend | o3z |
---|---|---|
<dbl> | <dbl> | <dbl> |
0.1047939 | -15.18551 | -1.333702 |
NA | -16.13117 | -1.416756 |
0.2714605 | -16.10218 | -1.414211 |
13.1047939 | -15.06051 | -1.322724 |
NA | -14.68551 | -1.289789 |
14.1047939 | -13.60218 | -1.194642 |
Note that there are only three columns in the transmuted data frame.
Here’s a table of helper functions specifically associated with mutate()
:
Helper Function | Description | Example |
---|---|---|
if_else() |
A vectorized conditional operation (if-else). | mutate(data, status = if_else(age < 18, "minor", "adult")) |
case_when() |
Generalized multiple conditions for transformations. | mutate(data, category = case_when(score > 90 ~ "A", score > 80 ~ "B", TRUE ~ "C")) |
lead() |
Access the subsequent value in the column. | mutate(data, next_value = lead(column_name)) |
lag() |
Access the preceding value in the column. | mutate(data, previous_value = lag(column_name)) |
row_number() |
Generate row numbers, often used with arrange() . |
mutate(data, rank = row_number()) |
rank() , min_rank() , dense_rank() |
Different types of ranking functions. | mutate(data, rank = rank(score)) |
cumsum() , cumprod() , cummin() , cummax() |
Cumulative functions. | mutate(data, cumulative_total = cumsum(total)) |
coalesce() |
Return the first non-missing value among its arguments. | mutate(data, combined = coalesce(col1, col2)) |
na_if() |
Replace a specified value with NA. | mutate(data, updated_col = na_if(column_name, "replace_me")) |
replace_na() |
Replace NA values with a specified value. | mutate(data, updated_col = replace_na(column_name, "new_value")) |
recode() |
Recode or replace values in a column. | mutate(data, recoded_col = recode(column_name, old="new", ...)) |
across() |
Apply a function across multiple columns (from dplyr 1.0.0). |
mutate(data, across(c(col1, col2), function_name)) |
group_by()
The group_by()
function is used to generate summary or aggregate statistics from a data frame.
In conjunction with the group_by()
function we often use the summarize()
function.
The general operation here is a combination of:
- splitting a table into separate pieces defined by a variable or group of variables (
group_by()
), - then applying a summary function across those subsets (
summarize()
), - then combining back into another table.
This pattern is called split, apply, and combine.
It is a fundamental pattern in data processing.
For example, in this air pollution dataset, you might want to know the average annual level of PM2.5.
The stratum is the year, which we can derive from the date
variable.
First, we can create a year
variable using as.POSIXlt()
.
Then we group on our new column and compute summary statistics for each year with the summarize()
function.
<- chicago %>%
years mutate(year = as.POSIXlt(date)$year + 1900) %>%
group_by(year) %>%
summarize(pm25 = mean(pm25, na.rm = TRUE),
o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE), .groups = "drop")
%>% head() years
year | pm25 | o3 | no2 |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> |
1987 | NaN | 62.96966 | 23.49369 |
1988 | NaN | 61.67708 | 24.52296 |
1989 | NaN | 59.72727 | 26.14062 |
1990 | NaN | 52.22917 | 22.59583 |
1991 | NaN | 63.10417 | 21.38194 |
1992 | NaN | 50.82870 | 24.78921 |
To see why we add \(1900\) in the preceding mutation, note that the date function returns just the last two digits of the year.
as.POSIXlt(chicago$date)$year %>% head()
- 87
- 87
- 87
- 87
- 87
- 87
Note in the process, we implicitly created a separate data frame that splits the original data frame by year, which is then passed to summarize()
.
summarize()
then returns a data frame with columns for year
and the annual averages of pm25
, o3
, and no2
.
In a slightly more complicated example, we might want to know what are the average levels of ozone o3
and nitrogen dioxide no2
within quantiles of pm25
.
We could do this with a regression model, but we can also do this quickly with group_by()
and summarize()
.
First, we can create a categorical variable of pm25
divided into quantiles.
<- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)
qq print(qq)
0% 20% 40% 60% 80% 100%
1.700 8.700 12.375 16.700 22.610 61.500
Then we group the data frame by the pm25.quint
variable and compute the mean of o3
and no2
within quantiles of pm25
.
%>%
chicago mutate(pm25.quint = cut(pm25, qq)) %>%
group_by(pm25.quint) %>%
summarize(o3 = mean(o3tmean2, na.rm = TRUE),
no2 = mean(no2tmean2, na.rm = TRUE),
.groups = "drop")
pm25.quint | o3 | no2 |
---|---|---|
<fct> | <dbl> | <dbl> |
(1.7,8.7] | 21.66401 | 17.99129 |
(8.7,12.4] | 20.38248 | 22.13004 |
(12.4,16.7] | 20.66160 | 24.35708 |
(16.7,22.6] | 19.88122 | 27.27132 |
(22.6,61.5] | 20.31775 | 29.64427 |
NA | 18.79044 | 25.77585 |
From the table, it seems there isn’t a strong relationship between pm25
and o3
, but there appears to be a positive correlation between pm25
and no2
.
More sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of Dplyr functions can often get you most of the way there.
Here is another example:
%>%
chicago mutate(month = as.POSIXlt(date)$mon + 1) %>%
group_by(month) %>% # SPLIT
summarize(pm25 = mean(pm25, na.rm = TRUE), # APPLY
o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE),
.groups = "drop")
# COMBINE (in the result)
month | pm25 | o3 | no2 |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> |
1 | 17.76996 | 28.22222 | 25.35417 |
2 | 20.37513 | 37.37500 | 26.78034 |
3 | 17.40818 | 39.05000 | 26.76984 |
4 | 13.85879 | 47.94907 | 25.03125 |
5 | 14.07420 | 52.75000 | 24.22222 |
6 | 15.86461 | 66.58750 | 25.01140 |
7 | 16.57087 | 59.54167 | 22.38442 |
8 | 16.93380 | 53.96701 | 22.98333 |
9 | 15.91279 | 57.48864 | 24.47917 |
10 | 14.23557 | 47.09275 | 24.15217 |
11 | 15.15794 | 29.45833 | 23.56537 |
12 | 17.52221 | 27.70833 | 24.45773 |
Here’s a table of helper functions specifically associated with group_by()
:
Helper Function | Description | Example |
---|---|---|
add_tally() |
Adds a column with the count of each group (equivalent to adding a count with mutate() after a group_by() ). |
group_by(data, group_var) %>% add_tally() |
group_size() |
Returns the size of each group. | group_by(data, group_var) %>% summarise(size = group_size()) |
ntile() |
Creates rank percentiles within groups. | group_by(data, group_var) %>% mutate(perc = ntile(score, 4)) |
cur_group_id() |
Gives an integer identifier for the current group. | group_by(data, group_var) %>% filter(cur_group_id() == 1) |
cur_group_rows() |
Provides row indices for the current group. | Used internally, but can be useful for advanced operations. |
group_map() , group_modify() , group_walk() |
Functions to iterate over groups, modify them or walk them with side effects. | group_map(data, ~ head(.x, 2)) shows the first two rows of each group. |
It’s worth noting that group_by()
often works in tandem with other verbs, so the real power comes from the combined operations.
For example, after grouping data, one often uses summarise()
to calculate summary statistics for each group.
The combination of these functions and their associated helpers allow for a wide range of group-wise data manipulation tasks.
Based on Peng 2022