NB: Dplyr

Programming for Data Science

What is Dplyr?

The Dplyr package is a cetnral component of the Tidyverse.

Technically, Dplyr is an optimized and distilled version of the previous Plyr package.

Plyr organizes much of the functionality of the family of apply functions in base R — functions like apply(), lapply(), sapply(), and tapply().

These functions perform the same action on multiple chunks of data in a repetitive way.

For example, apply() applies a function to every element of a row or column in a dataset.

Why Dplyr?

Dplyr was developed to provide a clean and intelligble way to process data tables.

It does this by providing a simple grammar for data manipulation and for operating on data frames.

With this grammar, you can clearly express and communicate what it is that you are doing to a data frame that other people can understand.

This is useful because it provides an abstraction for data manipulation that previously did not exist within R and other programming languages.

Another useful contribution is that the Dplyr functions are very fast, as many key operations are coded in C++.

Note that Dplyr does not provide any new functionality to R per se.

Everything Dplyr does may already be done with base R.

But it greatly simplifies existing functionality in R.

In the table below, you can see that that Dplyr provides a consistent way of phrasing the someone chaotic idioms of base R.

dplyr	base R
`arrange(df, x)`	`df[order(x), , drop = FALSE]`
`distinct(df, x)`	`df[!duplicated(x), , drop = FALSE], unique()`
`filter(df, x)`	`df[which(x), , drop = FALSE], subset()`
`mutate(df, z = x + y)`	`df$z <- df$x + df$y, transform()`
`pull(df, 1)`	`df[[1]]`
`pull(df, x)`	`df$x`
`rename(df, y = x)`	`names(df)[names(df) == "x"] <- "y"`
`relocate(df, y)`	`df[union("y", names(df))]`
`select(df, x, y)`	`df[c("x", "y")], subset()`
`select(df, starts_with("x"))`	`df[grepl("`
`summarise(df, mean(x))`	`mean(df$x), tapply(), aggregate(), by()`
`slice(df, c(1, 2, 5))`	`df[c(1, 2, 5), , drop = FALSE]`

A Grammar of Data

Some of the key “verbs” provided by the Dplyr package are:

select: return a subset of the columns of a data frame, using a flexible notation.
filter: extract a subset of rows from a data frame based on logical conditions.
arrange: reorder rows of a. data frame.
rename: rename variables in a data frame.
mutate: add new variables/columns or transform existing variables.
summarise / summarize: generate summary statistics of different variables in the data frame, possibly within strata.

%>%: the pipe operator is used to connect multiple verb actions together into a pipeline.

Common Function Properties

All Dplyr functions have some common characteristics:

The first argument is a data frame (or tibble).
The succeeding arguments describe what to do with the data frame specified in the first argument.
The return result of a function is a new data frame.

In addition, Dplyr expects data frames to be be properly structured as tidy data.

Installing Dplyr

You can install Dplyr directly or as part of Tidyverse, which we installed in the previous notebook.

# install.packages("dplyr")

Once the package is installed, you may load it into your R session by calling the library() function.

# library(dplyr)

Of course, you can also just install tidyverse if you plan to visualize your results.

Since we did this in our previous notebook, we can import it now.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You may get some warnings when the package is loaded because there are functions in the dplyr package that have the same name as functions in other packages. For now you can ignore the warnings.

Getting a dataset

Let’s use a dataset containing air pollution and temperature data for the city of Chicago in the U.S.

You will find the data in the directory for this module.

We load the data into R using the readRDS() function and then convert it to a tibble.

chicago <- readRDS("chicago.rds") %>% as_tibble()

We can see some basic characteristics of the dataset with the print() function.

print(chicago)

# A tibble: 6,940 × 8
   city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
   <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
 1 chic   31.5  31.5 1987-01-01         NA       34       4.25      20.0
 2 chic   33    29.9 1987-01-02         NA       NA       3.30      23.2
 3 chic   33    27.4 1987-01-03         NA       34.2     3.33      23.8
 4 chic   29    28.6 1987-01-04         NA       47       4.38      30.4
 5 chic   32    28.9 1987-01-05         NA       NA       4.75      30.3
 6 chic   40    35.1 1987-01-06         NA       48       5.83      25.8
 7 chic   34.5  26.8 1987-01-07         NA       41       9.29      20.6
 8 chic   29    22   1987-01-08         NA       36      11.3       17.0
 9 chic   26.5  29   1987-01-09         NA       33.3     4.5       23.4
10 chic   32.5  27.8 1987-01-10         NA       NA       4.96      19.5
# ℹ 6,930 more rows

Dplyr Functions in Depth

Let’s use this tibble to explore each of the Dplyr verbs in a little more depth.

`select()`

We use the select() function to select columns you want to focus on.

For example, we can inspect the tmpd column like so:

chicago %>% select(tmpd) %>% head()

A tibble: 6 × 1
tmpd
<dbl>
31.5
33.0
33.0
29.0
32.0
40.0

In base R, we would do this:

head(chicago["tmpd"])

A tibble: 6 × 1
tmpd
<dbl>
31.5
33.0
33.0
29.0
32.0
40.0

Now suppose we wanted to take the first $3$ columns only.

We could use numerical indices in base R.

head(chicago[1:3])

A tibble: 6 × 3
city	tmpd	dewpoint
<chr>	<dbl>	<dbl>
chic	31.5	31.500
chic	33.0	29.875
chic	33.0	27.375
chic	29.0	28.625
chic	32.0	28.875
chic	40.0	35.125

In Dplr, we can use the names directly to create a column range slice:

chicago %>%
  select(city:dptp) %>%
  head()

A tibble: 6 × 3
city	tmpd	dptp
<chr>	<dbl>	<dbl>
chic	31.5	31.500
chic	33.0	29.875
chic	33.0	27.375
chic	29.0	28.625
chic	32.0	28.875
chic	40.0	35.125

Note the use of : inside the select() — you can use it to specify a range of variable names.

You can also omit variables using the select() function by using the negative sign.

With select() you can do:

chicago %>%
    select(-(city:dptp)) %>% 
    head()

A tibble: 6 × 5
date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
<date>	<dbl>	<dbl>	<dbl>	<dbl>
1987-01-01	NA	34.00000	4.250000	19.98810
1987-01-02	NA	NA	3.304348	23.19099
1987-01-03	NA	34.16667	3.333333	23.81548
1987-01-04	NA	47.00000	4.375000	30.43452
1987-01-05	NA	NA	4.750000	30.33333
1987-01-06	NA	48.00000	5.833333	25.77233

This indicates that we should include every variable except the variables city through dptp.

The select() function also allows a special syntax that allows you to specify variable names based on patterns.

If you wanted to keep every variable that ends with a “2”, we could do this:

chicago %>%
  select(ends_with("2")) %>%
  head()

A tibble: 6 × 4
pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
<dbl>	<dbl>	<dbl>	<dbl>
NA	34.00000	4.250000	19.98810
NA	NA	3.304348	23.19099
NA	34.16667	3.333333	23.81548
NA	47.00000	4.375000	30.43452
NA	NA	4.750000	30.33333
NA	48.00000	5.833333	25.77233

Or if we wanted to keep every variable that starts with a “d”, we could do this:

chicago %>%
  select(starts_with("d")) %>%
  head()

A tibble: 6 × 2
dptp	date
<dbl>	<date>
31.500	1987-01-01
29.875	1987-01-02
27.375	1987-01-03
28.625	1987-01-04
28.875	1987-01-05
35.125	1987-01-06

This is super useful if you have a table with prefixes, such as doc_.

You can also use more general regular expressions if necessary.

Aside: Helper Functions

Note the use of the function ends_with() inside of the argument space for the function select() above.

ends_with() is an example of a “helper function” — a function that helps make Dplyr functions more effective.

There are many of these introduced by Dplyr.

Here is a set of helper functions that can be especially useful with select():

Helper Function	Description	Example
`starts_with()`	Matches variable names that start with a given prefix	`select(data, starts_with("X"))`
`ends_with()`	Matches variable names that end with a given suffix	`select(data, ends_with("_2019"))`
`contains()`	Matches variable names that contain a given string	`select(data, contains("income"))`
`matches()`	Matches variable names that match a given regular expression	`select(data, matches("^X[1-3]$"))`
`num_range()`	Matches variable names that match a pattern of prefix + number	`select(data, num_range("X", 1:3))`
`one_of()`	Matches variable names that are in a given set of names	`select(data, one_of(c("var1", "var2")))`
`everything()`	Matches all variable names (used to reorder variables)	`select(data, starts_with("X"), everything())`

`filter()`

The filter() function extracts subsets of rows from a data frame.

Whereas select() is a column-wise operation, filter() is row-wise.

This function is similar to the existing subset() function in R but is quite a bit faster.

Suppose we wanted to extract the rows of the chicago data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do this:

chicago %>%
  filter(pm25tmean2 > 30) %>%
  head()

A tibble: 6 × 8
city	tmpd	dptp	date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
<chr>	<dbl>	<dbl>	<date>	<dbl>	<dbl>	<dbl>	<dbl>
chic	23	21.9	1998-01-17	38.10	32.46154	3.180556	25.30000
chic	28	25.8	1998-01-23	33.95	38.69231	1.750000	29.37630
chic	55	51.3	1998-04-30	39.40	34.00000	10.786232	25.31310
chic	59	53.7	1998-05-01	35.40	28.50000	14.295125	31.42905
chic	57	52.0	1998-05-02	33.30	35.00000	20.662879	26.79861
chic	57	56.0	1998-05-07	32.10	34.50000	24.270422	33.99167

Compare this to base R:

head(na.omit(chicago[(chicago$pm25tmean2 > 30),]))

A tibble: 6 × 8
city	tmpd	dptp	date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
<chr>	<dbl>	<dbl>	<date>	<dbl>	<dbl>	<dbl>	<dbl>
chic	23	21.9	1998-01-17	38.10	32.46154	3.180556	25.30000
chic	28	25.8	1998-01-23	33.95	38.69231	1.750000	29.37630
chic	55	51.3	1998-04-30	39.40	34.00000	10.786232	25.31310
chic	59	53.7	1998-05-01	35.40	28.50000	14.295125	31.42905
chic	57	52.0	1998-05-02	33.30	35.00000	20.662879	26.79861
chic	57	56.0	1998-05-07	32.10	34.50000	24.270422	33.99167

Note that we have to filter out the NA values, which Dplyr does for you.

Note also that column names in Dply do not need to be prefixed by the data frame name; they are called directly.

For example, compare how the column pm25tmean2 is accessed below:

chicago %>% filter(pm25tmean2 > 30)

chicago[(chicago$pm25tmean2 > 30),]

We can place an arbitrarily complex logical sequence inside of filter().

For example, we may extract the rows where PM2.5 is greater than $30$ and temperature is greater than $80$ degrees Fahrenheit.

chicago %>%
    filter(pm25tmean2 > 30 & tmpd > 80) %>%
    select(date, tmpd, pm25tmean2) %>%
    head()

A tibble: 6 × 3
date	tmpd	pm25tmean2
<date>	<dbl>	<dbl>
1998-08-23	81	39.6000
1998-09-06	81	31.5000
2001-07-20	82	32.3000
2001-08-01	84	43.7000
2001-08-08	85	38.8375
2001-08-09	84	38.2000

Here’s a table of helper functions specifically associated with filter():

Helper Function	Description	Example
`between()`	Check if a value lies between two other values.	`filter(data, between(age, 18, 30))`
`near()`	Check for near equality (useful for floating-point numbers).	`filter(data, near(price, 19.99, tolerance = 0.01))`
`isin()` (or `%in%`)	Check if a value is in a set of values.	`filter(data, region %in% c("East", "West"))`
`all_of()`	Check for all of the variables, even if some don’t exist. Useful with character vectors.	`filter(data, all_of(variables))`
`any_of()`	Check for any of the variables. Useful with character vectors.	`filter(data, any_of(variables))`
`cur_group_id()`	Identify the current group inside `group_by()`. Useful for more complex filtering with groups.	`filter(data, cur_group_id() == 1)`

`arrange()`

We use arrange() to reorder rows of a data frame according to one or more of the columns.

This normally called sorting.

Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R.

The arrange() function simplifies the process quite a bit.

Here we order the rows of the data frame by date in ascending order:

chicago_by_date <- chicago %>%
    arrange(date)

chicago_by_date %>% head()

A tibble: 6 × 8
city	tmpd	dptp	date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
<chr>	<dbl>	<dbl>	<date>	<dbl>	<dbl>	<dbl>	<dbl>
chic	31.5	31.500	1987-01-01	NA	34.00000	4.250000	19.98810
chic	33.0	29.875	1987-01-02	NA	NA	3.304348	23.19099
chic	33.0	27.375	1987-01-03	NA	34.16667	3.333333	23.81548
chic	29.0	28.625	1987-01-04	NA	47.00000	4.375000	30.43452
chic	32.0	28.875	1987-01-05	NA	NA	4.750000	30.33333
chic	40.0	35.125	1987-01-06	NA	48.00000	5.833333	25.77233

We can now check the first few rows

chicago_by_date %>%
  select(date, pm25tmean2) %>%
  head(3)

A tibble: 3 × 2
date	pm25tmean2
<date>	<dbl>
1987-01-01	NA
1987-01-02	NA
1987-01-03	NA

and the last few rows.

chicago_by_date %>%
  select(date, pm25tmean2) %>%
  tail(3)

A tibble: 3 × 2
date	pm25tmean2
<date>	<dbl>
2005-12-29	7.45000
2005-12-30	15.05714
2005-12-31	15.00000

Columns can be arranged in descending order too by using the desc() helper function.

chicago %>%
  arrange(desc(date)) %>%
  head()

A tibble: 6 × 8
city	tmpd	dptp	date	pm25tmean2	pm10tmean2	o3tmean2	no2tmean2
<chr>	<dbl>	<dbl>	<date>	<dbl>	<dbl>	<dbl>	<dbl>
chic	35	30.1	2005-12-31	15.00000	23.5	2.531250	13.25000
chic	36	31.0	2005-12-30	15.05714	19.2	3.034420	22.80556
chic	35	29.4	2005-12-29	7.45000	23.5	6.794837	19.97222
chic	37	34.5	2005-12-28	17.75000	27.5	3.260417	19.28563
chic	40	33.6	2005-12-27	23.56000	27.0	4.468750	23.50000
chic	35	29.6	2005-12-26	8.40000	8.5	14.041667	16.81944

Here’s a table of helper functions specifically associated with arrange():

Helper Function	Description	Example
`desc()`	Order by descending instead of default ascending.	`arrange(data, desc(salary))`
`across()`	Apply to multiple columns, especially with `c_across()` in `dplyr` 1.0.0 and later.	`arrange(data, across(starts_with("X")))`
`row_number()`	Create an index based on the order of rows.	`mutate(data, idx = row_number())`

Note that row_number() is often seen in mutate() for creating an index column based on ordering.

The across() function provides a more concise syntax for working with multiple columns.

It is useful in arrange() when you’re trying to order rows based on a selection of columns that match a certain criterion.

Most of the magic with arrange() happens through the way you structure its arguments and not necessarily through a large set of helper functions.

`rename()`

Renaming a variable in a data frame in R is surprisingly hard to do!

The rename() function is designed to make this process easier.

Here we select the names of the first five variables in the chicago data frame.

chicago %>% select(1:5) %>% head(3)

A tibble: 3 × 5
city	tmpd	dptp	date	pm25tmean2
<chr>	<dbl>	<dbl>	<date>	<dbl>
chic	31.5	31.500	1987-01-01	NA
chic	33.0	29.875	1987-01-02	NA
chic	33.0	27.375	1987-01-03	NA

The dptp column represents the dew point temperature and the pm25tmean2 column the PM2.5 data.

These names are cryptic, so let’s rename them.

chicago <- chicago %>%
  rename(dewpoint = dptp, pm25 = pm25tmean2) 

chicago %>%  
  head(3)

A tibble: 3 × 8
city	tmpd	dewpoint	date	pm25	pm10tmean2	o3tmean2	no2tmean2
<chr>	<dbl>	<dbl>	<date>	<dbl>	<dbl>	<dbl>	<dbl>
chic	31.5	31.500	1987-01-01	NA	34.00000	4.250000	19.98810
chic	33.0	29.875	1987-01-02	NA	NA	3.304348	23.19099
chic	33.0	27.375	1987-01-03	NA	34.16667	3.333333	23.81548

The syntax inside the rename() function is to have the new name on the left-hand side of the = sign and the old name on the right-hand side.

Here’s a table of helper functions that can be used with rename():

Helper Function	Description	Example
`starts_with()`	Match columns that start with a given prefix.	`rename(data, start_X = starts_with("X"))` (Note: This example would rename the first column that starts with “X” to “start_X”)
`ends_with()`	Match columns that end with a given suffix.	`rename(data, end_year = ends_with("_2021"))`
`contains()`	Match columns that contain a given string.	`rename(data, has_rate = contains("rate"))`
`matches()`	Match columns that match a given regular expression.	`rename(data, match_num = matches("^X[1-3]$"))`
`everything()`	Match all columns.	Rarely used with `rename()`, but possible.

It’s worth noting that these helper functions will typically target the first column that meets the specified criterion.

If there are multiple columns that match the criterion (e.g., multiple columns that start with “X”), you’ll need to use these helpers with more precision or utilize other techniques.

Also, remember to always refer to the official Dplyr documentation as the package continues to evolve and new functionalities might be added over time.

`mutate()`

The mutate() function creates new columns based on computations usually involving the other columns.

For example, with air pollution data, we often want to de-trend the data by subtracting the mean from the data.

This allows us to see whether a given day’s air pollution level is higher than or less than average.

Here we create a pm25detrend variable that subtracts the mean from the pm25 variable.

We also compute the Z-score.

chicago <- chicago %>%
    mutate(pm25detrend = pm25 - mean(pm25, na.rm = TRUE),
       pm25z = pm25detrend / sd(pm25, na.rm = TRUE))

chicago %>% tail()

A tibble: 6 × 10
city	tmpd	dewpoint	date	pm25	pm10tmean2	o3tmean2	no2tmean2	pm25detrend	pm25z
<chr>	<dbl>	<dbl>	<date>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
chic	35	29.6	2005-12-26	8.40000	8.5	14.041667	16.81944	-7.830958	-0.9003429
chic	40	33.6	2005-12-27	23.56000	27.0	4.468750	23.50000	7.329042	0.8426366
chic	37	34.5	2005-12-28	17.75000	27.5	3.260417	19.28563	1.519042	0.1746477
chic	35	29.4	2005-12-29	7.45000	23.5	6.794837	19.97222	-8.780958	-1.0095666
chic	36	31.0	2005-12-30	15.05714	19.2	3.034420	22.80556	-1.173815	-0.1349561
chic	35	30.1	2005-12-31	15.00000	23.5	2.531250	13.25000	-1.230958	-0.1415260

Note how the new column name pm25detrend can be used immediately within the argument space of mutate().

Dplyr also has a related transmute() function.

It does the same thing as mutate() but then drops all non-transformed variables.

Here we de-trend the PM10 and ozone (O3) variables.

chicago_detrended <- chicago %>%
  transmute(pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE), 
            o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE),
            o3z = o3detrend / sd(o3tmean2, na.rm = TRUE))

chicago_detrended %>% head()

A tibble: 6 × 3
pm10detrend	o3detrend	o3z
<dbl>	<dbl>	<dbl>
0.1047939	-15.18551	-1.333702
NA	-16.13117	-1.416756
0.2714605	-16.10218	-1.414211
13.1047939	-15.06051	-1.322724
NA	-14.68551	-1.289789
14.1047939	-13.60218	-1.194642

Note that there are only three columns in the transmuted data frame.

Here’s a table of helper functions specifically associated with mutate():

Helper Function	Description	Example
`if_else()`	A vectorized conditional operation (if-else).	`mutate(data, status = if_else(age < 18, "minor", "adult"))`
`case_when()`	Generalized multiple conditions for transformations.	`mutate(data, category = case_when(score > 90 ~ "A", score > 80 ~ "B", TRUE ~ "C"))`
`lead()`	Access the subsequent value in the column.	`mutate(data, next_value = lead(column_name))`
`lag()`	Access the preceding value in the column.	`mutate(data, previous_value = lag(column_name))`
`row_number()`	Generate row numbers, often used with `arrange()`.	`mutate(data, rank = row_number())`
`rank()`, `min_rank()`, `dense_rank()`	Different types of ranking functions.	`mutate(data, rank = rank(score))`
`cumsum()`, `cumprod()`, `cummin()`, `cummax()`	Cumulative functions.	`mutate(data, cumulative_total = cumsum(total))`
`coalesce()`	Return the first non-missing value among its arguments.	`mutate(data, combined = coalesce(col1, col2))`
`na_if()`	Replace a specified value with NA.	`mutate(data, updated_col = na_if(column_name, "replace_me"))`
`replace_na()`	Replace NA values with a specified value.	`mutate(data, updated_col = replace_na(column_name, "new_value"))`
`recode()`	Recode or replace values in a column.	`mutate(data, recoded_col = recode(column_name, old="new", ...))`
`across()`	Apply a function across multiple columns (from `dplyr` 1.0.0).	`mutate(data, across(c(col1, col2), function_name))`

`group_by()`

The group_by() function is used to generate summary or aggregate statistics from a data frame.

In conjunction with the group_by() function we often use the summarize() function.

The general operation here is a combination of:

splitting a table into separate pieces defined by a variable or group of variables (group_by()),
then applying a summary function across those subsets (summarize()),
then combining back into another table.

This pattern is called split, apply, and combine.

It is a fundamental pattern in data processing.

For example, in this air pollution dataset, you might want to know the average annual level of PM2.5.

The stratum is the year, which we can derive from the date variable.

First, we can create a year variable using as.POSIXlt().

Then we group on our new column and compute summary statistics for each year with the summarize() function.

years <- chicago %>%
  mutate(year = as.POSIXlt(date)$year + 1900) %>%
  group_by(year) %>%
  summarize(pm25 = mean(pm25, na.rm = TRUE), 
          o3 = max(o3tmean2, na.rm = TRUE), 
          no2 = median(no2tmean2, na.rm = TRUE), .groups = "drop")

years %>% head()

A tibble: 6 × 4
year	pm25	o3	no2
<dbl>	<dbl>	<dbl>	<dbl>
1987	NaN	62.96966	23.49369
1988	NaN	61.67708	24.52296
1989	NaN	59.72727	26.14062
1990	NaN	52.22917	22.59583
1991	NaN	63.10417	21.38194
1992	NaN	50.82870	24.78921

To see why we add $1900$ in the preceding mutation, note that the date function returns just the last two digits of the year.

as.POSIXlt(chicago$date)$year %>% head()

Note in the process, we implicitly created a separate data frame that splits the original data frame by year, which is then passed to summarize().

summarize() then returns a data frame with columns for year and the annual averages of pm25, o3, and no2.

In a slightly more complicated example, we might want to know what are the average levels of ozone o3 and nitrogen dioxide no2 within quantiles of pm25.

We could do this with a regression model, but we can also do this quickly with group_by() and summarize().

First, we can create a categorical variable of pm25 divided into quantiles.

qq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)
print(qq)

    0%    20%    40%    60%    80%   100% 
 1.700  8.700 12.375 16.700 22.610 61.500

Then we group the data frame by the pm25.quint variable and compute the mean of o3 and no2 within quantiles of pm25.

chicago %>%
  mutate(pm25.quint = cut(pm25, qq)) %>%
  group_by(pm25.quint) %>%
  summarize(o3 = mean(o3tmean2, na.rm = TRUE), 
          no2 = mean(no2tmean2, na.rm = TRUE),
          .groups = "drop")

A tibble: 6 × 3
pm25.quint	o3	no2
<fct>	<dbl>	<dbl>
(1.7,8.7]	21.66401	17.99129
(8.7,12.4]	20.38248	22.13004
(12.4,16.7]	20.66160	24.35708
(16.7,22.6]	19.88122	27.27132
(22.6,61.5]	20.31775	29.64427
NA	18.79044	25.77585

From the table, it seems there isn’t a strong relationship between pm25 and o3, but there appears to be a positive correlation between pm25 and no2.

More sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of Dplyr functions can often get you most of the way there.

Here is another example:

chicago %>%  
  mutate(month = as.POSIXlt(date)$mon + 1) %>% 
  group_by(month) %>%                              # SPLIT
  summarize(pm25 = mean(pm25, na.rm = TRUE),       # APPLY 
    o3 = max(o3tmean2, na.rm = TRUE), 
    no2 = median(no2tmean2, na.rm = TRUE),
    .groups = "drop") 
                                                    # COMBINE (in the result)

A tibble: 12 × 4
month	pm25	o3	no2
<dbl>	<dbl>	<dbl>	<dbl>
1	17.76996	28.22222	25.35417
2	20.37513	37.37500	26.78034
3	17.40818	39.05000	26.76984
4	13.85879	47.94907	25.03125
5	14.07420	52.75000	24.22222
6	15.86461	66.58750	25.01140
7	16.57087	59.54167	22.38442
8	16.93380	53.96701	22.98333
9	15.91279	57.48864	24.47917
10	14.23557	47.09275	24.15217
11	15.15794	29.45833	23.56537
12	17.52221	27.70833	24.45773

Here’s a table of helper functions specifically associated with group_by():

Helper Function	Description	Example
`add_tally()`	Adds a column with the count of each group (equivalent to adding a count with `mutate()` after a `group_by()`).	`group_by(data, group_var) %>% add_tally()`
`group_size()`	Returns the size of each group.	`group_by(data, group_var) %>% summarise(size = group_size())`
`ntile()`	Creates rank percentiles within groups.	`group_by(data, group_var) %>% mutate(perc = ntile(score, 4))`
`cur_group_id()`	Gives an integer identifier for the current group.	`group_by(data, group_var) %>% filter(cur_group_id() == 1)`
`cur_group_rows()`	Provides row indices for the current group.	Used internally, but can be useful for advanced operations.
`group_map()`, `group_modify()`, `group_walk()`	Functions to iterate over groups, modify them or walk them with side effects.	`group_map(data, ~ head(.x, 2))` shows the first two rows of each group.

It’s worth noting that group_by() often works in tandem with other verbs, so the real power comes from the combined operations.

For example, after grouping data, one often uses summarise() to calculate summary statistics for each group.

The combination of these functions and their associated helpers allow for a wide range of group-wise data manipulation tasks.

Based on Peng 2022