NB: Dplyr

Programming for Data Science

What is Dplyr?

The Dplyr package is a cetnral component of the Tidyverse.

Technically, Dplyr is an optimized and distilled version of the previous Plyr package.

Plyr organizes much of the functionality of the family of apply functions in base R — functions like apply(), lapply(), sapply(), and tapply().

These functions perform the same action on multiple chunks of data in a repetitive way.

For example, apply() applies a function to every element of a row or column in a dataset.

Why Dplyr?

Dplyr was developed to provide a clean and intelligble way to process data tables.

It does this by providing a simple grammar for data manipulation and for operating on data frames.

With this grammar, you can clearly express and communicate what it is that you are doing to a data frame that other people can understand.

This is useful because it provides an abstraction for data manipulation that previously did not exist within R and other programming languages.

Another useful contribution is that the Dplyr functions are very fast, as many key operations are coded in C++.

Note that Dplyr does not provide any new functionality to R per se.

Everything Dplyr does may already be done with base R.

But it greatly simplifies existing functionality in R.

In the table below, you can see that that Dplyr provides a consistent way of phrasing the someone chaotic idioms of base R.

dplyr base R
arrange(df, x) df[order(x), , drop = FALSE]
distinct(df, x) df[!duplicated(x), , drop = FALSE], unique()
filter(df, x) df[which(x), , drop = FALSE], subset()
mutate(df, z = x + y) df$z <- df$x + df$y, transform()
pull(df, 1) df[[1]]
pull(df, x) df$x
rename(df, y = x) names(df)[names(df) == "x"] <- "y"
relocate(df, y) df[union("y", names(df))]
select(df, x, y) df[c("x", "y")], subset()
select(df, starts_with("x")) df[grepl("
summarise(df, mean(x)) mean(df$x), tapply(), aggregate(), by()
slice(df, c(1, 2, 5)) df[c(1, 2, 5), , drop = FALSE]

A Grammar of Data

Some of the key “verbs” provided by the Dplyr package are:

  • select: return a subset of the columns of a data frame, using a flexible notation.
  • filter: extract a subset of rows from a data frame based on logical conditions.
  • arrange: reorder rows of a. data frame.
  • rename: rename variables in a data frame.
  • mutate: add new variables/columns or transform existing variables.
  • summarise / summarize: generate summary statistics of different variables in the data frame, possibly within strata.

%>%: the pipe operator is used to connect multiple verb actions together into a pipeline.

Common Function Properties

All Dplyr functions have some common characteristics:

  1. The first argument is a data frame (or tibble).
  2. The succeeding arguments describe what to do with the data frame specified in the first argument.
  3. The return result of a function is a new data frame.

In addition, Dplyr expects data frames to be be properly structured as tidy data.

Installing Dplyr

You can install Dplyr directly or as part of Tidyverse, which we installed in the previous notebook.

# install.packages("dplyr")

Once the package is installed, you may load it into your R session by calling the library() function.

# library(dplyr)

Of course, you can also just install tidyverse if you plan to visualize your results.

Since we did this in our previous notebook, we can import it now.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You may get some warnings when the package is loaded because there are functions in the dplyr package that have the same name as functions in other packages. For now you can ignore the warnings.

Getting a dataset

Let’s use a dataset containing air pollution and temperature data for the city of Chicago in the U.S.

You will find the data in the directory for this module.

We load the data into R using the readRDS() function and then convert it to a tibble.

chicago <- readRDS("chicago.rds") %>% as_tibble()

We can see some basic characteristics of the dataset with the print() function.

print(chicago)
# A tibble: 6,940 × 8
   city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
   <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>
 1 chic   31.5  31.5 1987-01-01         NA       34       4.25      20.0
 2 chic   33    29.9 1987-01-02         NA       NA       3.30      23.2
 3 chic   33    27.4 1987-01-03         NA       34.2     3.33      23.8
 4 chic   29    28.6 1987-01-04         NA       47       4.38      30.4
 5 chic   32    28.9 1987-01-05         NA       NA       4.75      30.3
 6 chic   40    35.1 1987-01-06         NA       48       5.83      25.8
 7 chic   34.5  26.8 1987-01-07         NA       41       9.29      20.6
 8 chic   29    22   1987-01-08         NA       36      11.3       17.0
 9 chic   26.5  29   1987-01-09         NA       33.3     4.5       23.4
10 chic   32.5  27.8 1987-01-10         NA       NA       4.96      19.5
# ℹ 6,930 more rows

Dplyr Functions in Depth

Let’s use this tibble to explore each of the Dplyr verbs in a little more depth.

select()

We use the select() function to select columns you want to focus on.

For example, we can inspect the tmpd column like so:

chicago %>% select(tmpd) %>% head()
A tibble: 6 × 1
tmpd
<dbl>
31.5
33.0
33.0
29.0
32.0
40.0

In base R, we would do this:

head(chicago["tmpd"])
A tibble: 6 × 1
tmpd
<dbl>
31.5
33.0
33.0
29.0
32.0
40.0

Now suppose we wanted to take the first \(3\) columns only.

We could use numerical indices in base R.

head(chicago[1:3])
A tibble: 6 × 3
city tmpd dewpoint
<chr> <dbl> <dbl>
chic 31.5 31.500
chic 33.0 29.875
chic 33.0 27.375
chic 29.0 28.625
chic 32.0 28.875
chic 40.0 35.125

In Dplr, we can use the names directly to create a column range slice:

chicago %>%
  select(city:dptp) %>%
  head()
A tibble: 6 × 3
city tmpd dptp
<chr> <dbl> <dbl>
chic 31.5 31.500
chic 33.0 29.875
chic 33.0 27.375
chic 29.0 28.625
chic 32.0 28.875
chic 40.0 35.125

Note the use of : inside the select() — you can use it to specify a range of variable names.

You can also omit variables using the select() function by using the negative sign.

With select() you can do:

chicago %>%
    select(-(city:dptp)) %>% 
    head()
A tibble: 6 × 5
date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
<date> <dbl> <dbl> <dbl> <dbl>
1987-01-01 NA 34.00000 4.250000 19.98810
1987-01-02 NA NA 3.304348 23.19099
1987-01-03 NA 34.16667 3.333333 23.81548
1987-01-04 NA 47.00000 4.375000 30.43452
1987-01-05 NA NA 4.750000 30.33333
1987-01-06 NA 48.00000 5.833333 25.77233

This indicates that we should include every variable except the variables city through dptp.

The select() function also allows a special syntax that allows you to specify variable names based on patterns.

If you wanted to keep every variable that ends with a “2”, we could do this:

chicago %>%
  select(ends_with("2")) %>%
  head()
A tibble: 6 × 4
pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
<dbl> <dbl> <dbl> <dbl>
NA 34.00000 4.250000 19.98810
NA NA 3.304348 23.19099
NA 34.16667 3.333333 23.81548
NA 47.00000 4.375000 30.43452
NA NA 4.750000 30.33333
NA 48.00000 5.833333 25.77233

Or if we wanted to keep every variable that starts with a “d”, we could do this:

chicago %>%
  select(starts_with("d")) %>%
  head()
A tibble: 6 × 2
dptp date
<dbl> <date>
31.500 1987-01-01
29.875 1987-01-02
27.375 1987-01-03
28.625 1987-01-04
28.875 1987-01-05
35.125 1987-01-06

This is super useful if you have a table with prefixes, such as doc_.

You can also use more general regular expressions if necessary.

Aside: Helper Functions

Note the use of the function ends_with() inside of the argument space for the function select() above.

ends_with() is an example of a “helper function” — a function that helps make Dplyr functions more effective.

There are many of these introduced by Dplyr.

Here is a set of helper functions that can be especially useful with select():

Helper Function Description Example
starts_with() Matches variable names that start with a given prefix select(data, starts_with("X"))
ends_with() Matches variable names that end with a given suffix select(data, ends_with("_2019"))
contains() Matches variable names that contain a given string select(data, contains("income"))
matches() Matches variable names that match a given regular expression select(data, matches("^X[1-3]$"))
num_range() Matches variable names that match a pattern of prefix + number select(data, num_range("X", 1:3))
one_of() Matches variable names that are in a given set of names select(data, one_of(c("var1", "var2")))
everything() Matches all variable names (used to reorder variables) select(data, starts_with("X"), everything())

filter()

The filter() function extracts subsets of rows from a data frame.

Whereas select() is a column-wise operation, filter() is row-wise.

This function is similar to the existing subset() function in R but is quite a bit faster.

Suppose we wanted to extract the rows of the chicago data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do this:

chicago %>%
  filter(pm25tmean2 > 30) %>%
  head()
A tibble: 6 × 8
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
<chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
chic 23 21.9 1998-01-17 38.10 32.46154 3.180556 25.30000
chic 28 25.8 1998-01-23 33.95 38.69231 1.750000 29.37630
chic 55 51.3 1998-04-30 39.40 34.00000 10.786232 25.31310
chic 59 53.7 1998-05-01 35.40 28.50000 14.295125 31.42905
chic 57 52.0 1998-05-02 33.30 35.00000 20.662879 26.79861
chic 57 56.0 1998-05-07 32.10 34.50000 24.270422 33.99167

Compare this to base R:

head(na.omit(chicago[(chicago$pm25tmean2 > 30),]))
A tibble: 6 × 8
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
<chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
chic 23 21.9 1998-01-17 38.10 32.46154 3.180556 25.30000
chic 28 25.8 1998-01-23 33.95 38.69231 1.750000 29.37630
chic 55 51.3 1998-04-30 39.40 34.00000 10.786232 25.31310
chic 59 53.7 1998-05-01 35.40 28.50000 14.295125 31.42905
chic 57 52.0 1998-05-02 33.30 35.00000 20.662879 26.79861
chic 57 56.0 1998-05-07 32.10 34.50000 24.270422 33.99167

Note that we have to filter out the NA values, which Dplyr does for you.

Note also that column names in Dply do not need to be prefixed by the data frame name; they are called directly.

For example, compare how the column pm25tmean2 is accessed below:

chicago %>% filter(pm25tmean2 > 30)

vs

chicago[(chicago$pm25tmean2 > 30),]

We can place an arbitrarily complex logical sequence inside of filter().

For example, we may extract the rows where PM2.5 is greater than \(30\) and temperature is greater than \(80\) degrees Fahrenheit.

chicago %>%
    filter(pm25tmean2 > 30 & tmpd > 80) %>%
    select(date, tmpd, pm25tmean2) %>%
    head()
A tibble: 6 × 3
date tmpd pm25tmean2
<date> <dbl> <dbl>
1998-08-23 81 39.6000
1998-09-06 81 31.5000
2001-07-20 82 32.3000
2001-08-01 84 43.7000
2001-08-08 85 38.8375
2001-08-09 84 38.2000

Here’s a table of helper functions specifically associated with filter():

Helper Function Description Example
between() Check if a value lies between two other values. filter(data, between(age, 18, 30))
near() Check for near equality (useful for floating-point numbers). filter(data, near(price, 19.99, tolerance = 0.01))
isin() (or %in%) Check if a value is in a set of values. filter(data, region %in% c("East", "West"))
all_of() Check for all of the variables, even if some don’t exist. Useful with character vectors. filter(data, all_of(variables))
any_of() Check for any of the variables. Useful with character vectors. filter(data, any_of(variables))
cur_group_id() Identify the current group inside group_by(). Useful for more complex filtering with groups. filter(data, cur_group_id() == 1)

arrange()

We use arrange() to reorder rows of a data frame according to one or more of the columns.

This normally called sorting.

Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R.

The arrange() function simplifies the process quite a bit.

Here we order the rows of the data frame by date in ascending order:

chicago_by_date <- chicago %>%
    arrange(date)

chicago_by_date %>% head()
A tibble: 6 × 8
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
<chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
chic 31.5 31.500 1987-01-01 NA 34.00000 4.250000 19.98810
chic 33.0 29.875 1987-01-02 NA NA 3.304348 23.19099
chic 33.0 27.375 1987-01-03 NA 34.16667 3.333333 23.81548
chic 29.0 28.625 1987-01-04 NA 47.00000 4.375000 30.43452
chic 32.0 28.875 1987-01-05 NA NA 4.750000 30.33333
chic 40.0 35.125 1987-01-06 NA 48.00000 5.833333 25.77233

We can now check the first few rows

chicago_by_date %>%
  select(date, pm25tmean2) %>%
  head(3)
A tibble: 3 × 2
date pm25tmean2
<date> <dbl>
1987-01-01 NA
1987-01-02 NA
1987-01-03 NA

and the last few rows.

chicago_by_date %>%
  select(date, pm25tmean2) %>%
  tail(3)
A tibble: 3 × 2
date pm25tmean2
<date> <dbl>
2005-12-29 7.45000
2005-12-30 15.05714
2005-12-31 15.00000

Columns can be arranged in descending order too by using the desc() helper function.

chicago %>%
  arrange(desc(date)) %>%
  head()
A tibble: 6 × 8
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
<chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
chic 35 30.1 2005-12-31 15.00000 23.5 2.531250 13.25000
chic 36 31.0 2005-12-30 15.05714 19.2 3.034420 22.80556
chic 35 29.4 2005-12-29 7.45000 23.5 6.794837 19.97222
chic 37 34.5 2005-12-28 17.75000 27.5 3.260417 19.28563
chic 40 33.6 2005-12-27 23.56000 27.0 4.468750 23.50000
chic 35 29.6 2005-12-26 8.40000 8.5 14.041667 16.81944

Here’s a table of helper functions specifically associated with arrange():

Helper Function Description Example
desc() Order by descending instead of default ascending. arrange(data, desc(salary))
across() Apply to multiple columns, especially with c_across() in dplyr 1.0.0 and later. arrange(data, across(starts_with("X")))
row_number() Create an index based on the order of rows. mutate(data, idx = row_number())

Note that row_number() is often seen in mutate() for creating an index column based on ordering.

The across() function provides a more concise syntax for working with multiple columns.

It is useful in arrange() when you’re trying to order rows based on a selection of columns that match a certain criterion.

Most of the magic with arrange() happens through the way you structure its arguments and not necessarily through a large set of helper functions.

rename()

Renaming a variable in a data frame in R is surprisingly hard to do!

The rename() function is designed to make this process easier.

Here we select the names of the first five variables in the chicago data frame.

chicago %>% select(1:5) %>% head(3)
A tibble: 3 × 5
city tmpd dptp date pm25tmean2
<chr> <dbl> <dbl> <date> <dbl>
chic 31.5 31.500 1987-01-01 NA
chic 33.0 29.875 1987-01-02 NA
chic 33.0 27.375 1987-01-03 NA

The dptp column represents the dew point temperature and the pm25tmean2 column the PM2.5 data.

These names are cryptic, so let’s rename them.

chicago <- chicago %>%
  rename(dewpoint = dptp, pm25 = pm25tmean2) 

chicago %>%  
  head(3)
A tibble: 3 × 8
city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2
<chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
chic 31.5 31.500 1987-01-01 NA 34.00000 4.250000 19.98810
chic 33.0 29.875 1987-01-02 NA NA 3.304348 23.19099
chic 33.0 27.375 1987-01-03 NA 34.16667 3.333333 23.81548

The syntax inside the rename() function is to have the new name on the left-hand side of the = sign and the old name on the right-hand side.

Here’s a table of helper functions that can be used with rename():

Helper Function Description Example
starts_with() Match columns that start with a given prefix. rename(data, start_X = starts_with("X")) (Note: This example would rename the first column that starts with “X” to “start_X”)
ends_with() Match columns that end with a given suffix. rename(data, end_year = ends_with("_2021"))
contains() Match columns that contain a given string. rename(data, has_rate = contains("rate"))
matches() Match columns that match a given regular expression. rename(data, match_num = matches("^X[1-3]$"))
everything() Match all columns. Rarely used with rename(), but possible.

It’s worth noting that these helper functions will typically target the first column that meets the specified criterion.

If there are multiple columns that match the criterion (e.g., multiple columns that start with “X”), you’ll need to use these helpers with more precision or utilize other techniques.

Also, remember to always refer to the official Dplyr documentation as the package continues to evolve and new functionalities might be added over time.

mutate()

The mutate() function creates new columns based on computations usually involving the other columns.

For example, with air pollution data, we often want to de-trend the data by subtracting the mean from the data.

This allows us to see whether a given day’s air pollution level is higher than or less than average.

Here we create a pm25detrend variable that subtracts the mean from the pm25 variable.

We also compute the Z-score.

chicago <- chicago %>%
    mutate(pm25detrend = pm25 - mean(pm25, na.rm = TRUE),
       pm25z = pm25detrend / sd(pm25, na.rm = TRUE))

chicago %>% tail()
A tibble: 6 × 10
city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2 pm25detrend pm25z
<chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
chic 35 29.6 2005-12-26 8.40000 8.5 14.041667 16.81944 -7.830958 -0.9003429
chic 40 33.6 2005-12-27 23.56000 27.0 4.468750 23.50000 7.329042 0.8426366
chic 37 34.5 2005-12-28 17.75000 27.5 3.260417 19.28563 1.519042 0.1746477
chic 35 29.4 2005-12-29 7.45000 23.5 6.794837 19.97222 -8.780958 -1.0095666
chic 36 31.0 2005-12-30 15.05714 19.2 3.034420 22.80556 -1.173815 -0.1349561
chic 35 30.1 2005-12-31 15.00000 23.5 2.531250 13.25000 -1.230958 -0.1415260

Note how the new column name pm25detrend can be used immediately within the argument space of mutate().

Dplyr also has a related transmute() function.

It does the same thing as mutate() but then drops all non-transformed variables.

Here we de-trend the PM10 and ozone (O3) variables.

chicago_detrended <- chicago %>%
  transmute(pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE), 
            o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE),
            o3z = o3detrend / sd(o3tmean2, na.rm = TRUE))

chicago_detrended %>% head()
A tibble: 6 × 3
pm10detrend o3detrend o3z
<dbl> <dbl> <dbl>
0.1047939 -15.18551 -1.333702
NA -16.13117 -1.416756
0.2714605 -16.10218 -1.414211
13.1047939 -15.06051 -1.322724
NA -14.68551 -1.289789
14.1047939 -13.60218 -1.194642

Note that there are only three columns in the transmuted data frame.

Here’s a table of helper functions specifically associated with mutate():

Helper Function Description Example
if_else() A vectorized conditional operation (if-else). mutate(data, status = if_else(age < 18, "minor", "adult"))
case_when() Generalized multiple conditions for transformations. mutate(data, category = case_when(score > 90 ~ "A", score > 80 ~ "B", TRUE ~ "C"))
lead() Access the subsequent value in the column. mutate(data, next_value = lead(column_name))
lag() Access the preceding value in the column. mutate(data, previous_value = lag(column_name))
row_number() Generate row numbers, often used with arrange(). mutate(data, rank = row_number())
rank(), min_rank(), dense_rank() Different types of ranking functions. mutate(data, rank = rank(score))
cumsum(), cumprod(), cummin(), cummax() Cumulative functions. mutate(data, cumulative_total = cumsum(total))
coalesce() Return the first non-missing value among its arguments. mutate(data, combined = coalesce(col1, col2))
na_if() Replace a specified value with NA. mutate(data, updated_col = na_if(column_name, "replace_me"))
replace_na() Replace NA values with a specified value. mutate(data, updated_col = replace_na(column_name, "new_value"))
recode() Recode or replace values in a column. mutate(data, recoded_col = recode(column_name, old="new", ...))
across() Apply a function across multiple columns (from dplyr 1.0.0). mutate(data, across(c(col1, col2), function_name))

group_by()

The group_by() function is used to generate summary or aggregate statistics from a data frame.

In conjunction with the group_by() function we often use the summarize() function.

The general operation here is a combination of:

  • splitting a table into separate pieces defined by a variable or group of variables (group_by()),
  • then applying a summary function across those subsets (summarize()),
  • then combining back into another table.

This pattern is called split, apply, and combine.

It is a fundamental pattern in data processing.

For example, in this air pollution dataset, you might want to know the average annual level of PM2.5.

The stratum is the year, which we can derive from the date variable.

First, we can create a year variable using as.POSIXlt().

Then we group on our new column and compute summary statistics for each year with the summarize() function.

years <- chicago %>%
  mutate(year = as.POSIXlt(date)$year + 1900) %>%
  group_by(year) %>%
  summarize(pm25 = mean(pm25, na.rm = TRUE), 
          o3 = max(o3tmean2, na.rm = TRUE), 
          no2 = median(no2tmean2, na.rm = TRUE), .groups = "drop")

years %>% head()
A tibble: 6 × 4
year pm25 o3 no2
<dbl> <dbl> <dbl> <dbl>
1987 NaN 62.96966 23.49369
1988 NaN 61.67708 24.52296
1989 NaN 59.72727 26.14062
1990 NaN 52.22917 22.59583
1991 NaN 63.10417 21.38194
1992 NaN 50.82870 24.78921

To see why we add \(1900\) in the preceding mutation, note that the date function returns just the last two digits of the year.

as.POSIXlt(chicago$date)$year %>% head()
  1. 87
  2. 87
  3. 87
  4. 87
  5. 87
  6. 87

Note in the process, we implicitly created a separate data frame that splits the original data frame by year, which is then passed to summarize().

summarize() then returns a data frame with columns for year and the annual averages of pm25, o3, and no2.

In a slightly more complicated example, we might want to know what are the average levels of ozone o3 and nitrogen dioxide no2 within quantiles of pm25.

We could do this with a regression model, but we can also do this quickly with group_by() and summarize().

First, we can create a categorical variable of pm25 divided into quantiles.

qq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)
print(qq)
    0%    20%    40%    60%    80%   100% 
 1.700  8.700 12.375 16.700 22.610 61.500 

Then we group the data frame by the pm25.quint variable and compute the mean of o3 and no2 within quantiles of pm25.

chicago %>%
  mutate(pm25.quint = cut(pm25, qq)) %>%
  group_by(pm25.quint) %>%
  summarize(o3 = mean(o3tmean2, na.rm = TRUE), 
          no2 = mean(no2tmean2, na.rm = TRUE),
          .groups = "drop")
A tibble: 6 × 3
pm25.quint o3 no2
<fct> <dbl> <dbl>
(1.7,8.7] 21.66401 17.99129
(8.7,12.4] 20.38248 22.13004
(12.4,16.7] 20.66160 24.35708
(16.7,22.6] 19.88122 27.27132
(22.6,61.5] 20.31775 29.64427
NA 18.79044 25.77585

From the table, it seems there isn’t a strong relationship between pm25 and o3, but there appears to be a positive correlation between pm25 and no2.

More sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of Dplyr functions can often get you most of the way there.

Here is another example:

chicago %>%  
  mutate(month = as.POSIXlt(date)$mon + 1) %>% 
  group_by(month) %>%                              # SPLIT
  summarize(pm25 = mean(pm25, na.rm = TRUE),       # APPLY 
    o3 = max(o3tmean2, na.rm = TRUE), 
    no2 = median(no2tmean2, na.rm = TRUE),
    .groups = "drop") 
                                                    # COMBINE (in the result)
A tibble: 12 × 4
month pm25 o3 no2
<dbl> <dbl> <dbl> <dbl>
1 17.76996 28.22222 25.35417
2 20.37513 37.37500 26.78034
3 17.40818 39.05000 26.76984
4 13.85879 47.94907 25.03125
5 14.07420 52.75000 24.22222
6 15.86461 66.58750 25.01140
7 16.57087 59.54167 22.38442
8 16.93380 53.96701 22.98333
9 15.91279 57.48864 24.47917
10 14.23557 47.09275 24.15217
11 15.15794 29.45833 23.56537
12 17.52221 27.70833 24.45773

Here’s a table of helper functions specifically associated with group_by():

Helper Function Description Example
add_tally() Adds a column with the count of each group (equivalent to adding a count with mutate() after a group_by()). group_by(data, group_var) %>% add_tally()
group_size() Returns the size of each group. group_by(data, group_var) %>% summarise(size = group_size())
ntile() Creates rank percentiles within groups. group_by(data, group_var) %>% mutate(perc = ntile(score, 4))
cur_group_id() Gives an integer identifier for the current group. group_by(data, group_var) %>% filter(cur_group_id() == 1)
cur_group_rows() Provides row indices for the current group. Used internally, but can be useful for advanced operations.
group_map(), group_modify(), group_walk() Functions to iterate over groups, modify them or walk them with side effects. group_map(data, ~ head(.x, 2)) shows the first two rows of each group.

It’s worth noting that group_by() often works in tandem with other verbs, so the real power comes from the combined operations.

For example, after grouping data, one often uses summarise() to calculate summary statistics for each group.

The combination of these functions and their associated helpers allow for a wide range of group-wise data manipulation tasks.

Based on Peng 2022