R Dplyr Exercises¶
Programming for Data Science Bootcamp
Exercise 1¶
Install Tidyverse
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.3 ✔ readr 2.1.4 ✔ forcats 1.0.0 ✔ stringr 1.5.0 ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.0 ✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Exercise 2¶
We create a super simple tibble to use in the exercises.
You can find a text file called exercise-data.txt
in the directory of today's module (M11
).
Just cut and paste it into you code.
scores <-
tibble(
name = c("mike", "carol", "greg", "marcia", "peter", "jan", "bobby", "cindy", "alice"),
school = c("south", "south", "south", "south", "north", "north", "north", "south", "south"),
teacher = c("johnson", "johnson", "johnson", "johnson", "smith", "smith", "smith", "perry", "perry"),
sex = c("male", "female", "male", "female", "male", "female", "male", "female", "female"),
math_score = c(4, 3, 2, 4, 3, 4, 5, 4, 5),
reading_score = c(1, 5, 2, 4, 5, 4, 1, 5, 4)
)
Exercise 3¶
View the tibble.
scores
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 |
carol | south | johnson | female | 3 | 5 |
greg | south | johnson | male | 2 | 2 |
marcia | south | johnson | female | 4 | 4 |
peter | north | smith | male | 3 | 5 |
jan | north | smith | female | 4 | 4 |
bobby | north | smith | male | 5 | 1 |
cindy | south | perry | female | 4 | 5 |
alice | south | perry | female | 5 | 4 |
print(scores)
# A tibble: 9 × 6 name school teacher sex math_score reading_score <chr> <chr> <chr> <chr> <dbl> <dbl> 1 mike south johnson male 4 1 2 carol south johnson female 3 5 3 greg south johnson male 2 2 4 marcia south johnson female 4 4 5 peter north smith male 3 5 6 jan north smith female 4 4 7 bobby north smith male 5 1 8 cindy south perry female 4 5 9 alice south perry female 5 4
Exercise 4¶
Make sure you understand the difference between
- doing something and assigning it to a variable
- and just doing it without assigning it
when using %>%
.
First, just get the first three rows (using slice()
).
Then, assign the first three rows to a variable.
scores %>%
slice(1:3)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 |
carol | south | johnson | female | 3 | 5 |
greg | south | johnson | male | 2 | 2 |
Get the first 3 rows, and assign it to a new name scores_small
.
scores_small <- scores %>%
slice(1:3)
See what's in scores_small
.
scores_small
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 |
carol | south | johnson | female | 3 | 5 |
greg | south | johnson | male | 2 | 2 |
Another option:
scores %>% head(3)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 |
carol | south | johnson | female | 3 | 5 |
greg | south | johnson | male | 2 | 2 |
And the old way:
scores[1:3,]
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 |
carol | south | johnson | female | 3 | 5 |
greg | south | johnson | male | 2 | 2 |
Part 1: Arrange¶
scores %>%
arrange(desc(math_score))
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
bobby | north | smith | male | 5 | 1 |
alice | south | perry | female | 5 | 4 |
mike | south | johnson | male | 4 | 1 |
marcia | south | johnson | female | 4 | 4 |
jan | north | smith | female | 4 | 4 |
cindy | south | perry | female | 4 | 5 |
carol | south | johnson | female | 3 | 5 |
peter | north | smith | male | 3 | 5 |
greg | south | johnson | male | 2 | 2 |
Answer: Bobby and Alice both tied for the highest math score
Exercise 6¶
Sort the data by name
in alphabetical order.
scores %>%
arrange(name)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
alice | south | perry | female | 5 | 4 |
bobby | north | smith | male | 5 | 1 |
carol | south | johnson | female | 3 | 5 |
cindy | south | perry | female | 4 | 5 |
greg | south | johnson | male | 2 | 2 |
jan | north | smith | female | 4 | 4 |
marcia | south | johnson | female | 4 | 4 |
mike | south | johnson | male | 4 | 1 |
peter | north | smith | male | 3 | 5 |
Exercise 7¶
Sort the data by sex
so females show up first.
scores %>%
arrange(sex)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
carol | south | johnson | female | 3 | 5 |
marcia | south | johnson | female | 4 | 4 |
jan | north | smith | female | 4 | 4 |
cindy | south | perry | female | 4 | 5 |
alice | south | perry | female | 5 | 4 |
mike | south | johnson | male | 4 | 1 |
greg | south | johnson | male | 2 | 2 |
peter | north | smith | male | 3 | 5 |
bobby | north | smith | male | 5 | 1 |
Exercise 8¶
Sort the data by school
, teacher
, sex
, math_score
, and reading_score
.
scores %>%
arrange(school, teacher, sex, math_score, reading_score)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
jan | north | smith | female | 4 | 4 |
peter | north | smith | male | 3 | 5 |
bobby | north | smith | male | 5 | 1 |
carol | south | johnson | female | 3 | 5 |
marcia | south | johnson | female | 4 | 4 |
greg | south | johnson | male | 2 | 2 |
mike | south | johnson | male | 4 | 1 |
cindy | south | perry | female | 4 | 5 |
alice | south | perry | female | 5 | 4 |
Part 2: Select¶
Exercise 9¶
Select only name
, math_score
, and reading_score
.
scores %>%
select(name, math_score, reading_score)
name | math_score | reading_score |
---|---|---|
<chr> | <dbl> | <dbl> |
mike | 4 | 1 |
carol | 3 | 5 |
greg | 2 | 2 |
marcia | 4 | 4 |
peter | 3 | 5 |
jan | 4 | 4 |
bobby | 5 | 1 |
cindy | 4 | 5 |
alice | 5 | 4 |
Exercise 10¶
Select all of the columns except the sex
column.
scores %>%
select(-sex)
name | school | teacher | math_score | reading_score |
---|---|---|---|---|
<chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | 4 | 1 |
carol | south | johnson | 3 | 5 |
greg | south | johnson | 2 | 2 |
marcia | south | johnson | 4 | 4 |
peter | north | smith | 3 | 5 |
jan | north | smith | 4 | 4 |
bobby | north | smith | 5 | 1 |
cindy | south | perry | 4 | 5 |
alice | south | perry | 5 | 4 |
Exercise 11¶
Select all of the columns except the math_score
and
reading_score columns
.
scores %>%
select(-sex, -reading_score)
name | school | teacher | math_score |
---|---|---|---|
<chr> | <chr> | <chr> | <dbl> |
mike | south | johnson | 4 |
carol | south | johnson | 3 |
greg | south | johnson | 2 |
marcia | south | johnson | 4 |
peter | north | smith | 3 |
jan | north | smith | 4 |
bobby | north | smith | 5 |
cindy | south | perry | 4 |
alice | south | perry | 5 |
Exercise 12¶
Keep all of the columns but rearrange them so that sex
is the
first column.
scores %>%
select(sex, everything())
sex | name | school | teacher | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
male | mike | south | johnson | 4 | 1 |
female | carol | south | johnson | 3 | 5 |
male | greg | south | johnson | 2 | 2 |
female | marcia | south | johnson | 4 | 4 |
male | peter | north | smith | 3 | 5 |
female | jan | north | smith | 4 | 4 |
male | bobby | north | smith | 5 | 1 |
female | cindy | south | perry | 4 | 5 |
female | alice | south | perry | 5 | 4 |
Part 3: Filter¶
Exercise 13¶
Filter by students who are male
and went to south
.
Option 1
scores %>%
filter(sex == "male" & school == "south")
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 |
greg | south | johnson | male | 2 | 2 |
Option 2
Using a comma ,
instead of &
.
scores %>%
filter(sex == "male", school == "south")
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 |
greg | south | johnson | male | 2 | 2 |
Exercise 14¶
Filter by students who did above average in math.
scores %>%
filter(math_score > mean(math_score))
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 |
marcia | south | johnson | female | 4 | 4 |
jan | north | smith | female | 4 | 4 |
bobby | north | smith | male | 5 | 1 |
cindy | south | perry | female | 4 | 5 |
alice | south | perry | female | 5 | 4 |
Exercise 15¶
Use filter()
to figure out how many students had a math
score of $4$ or more and a reading score of $3$ or more.
scores %>%
filter(math_score >= 4 & reading_score >= 3)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
marcia | south | johnson | female | 4 | 4 |
jan | north | smith | female | 4 | 4 |
cindy | south | perry | female | 4 | 5 |
alice | south | perry | female | 5 | 4 |
Exercise 16¶
Filter by students who got a $3$ or worse in either math or reading.
scores %>%
filter(math_score <= 3 | reading_score <= 3)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 |
carol | south | johnson | female | 3 | 5 |
greg | south | johnson | male | 2 | 2 |
peter | north | smith | male | 3 | 5 |
bobby | north | smith | male | 5 | 1 |
Exercise 17¶
Filter by students who got a reading score of $2$, $3$, or $4$.
scores %>%
filter(reading_score == 2 | reading_score == 3 | reading_score == 4)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
greg | south | johnson | male | 2 | 2 |
marcia | south | johnson | female | 4 | 4 |
jan | north | smith | female | 4 | 4 |
alice | south | perry | female | 5 | 4 |
scores %>%
filter(reading_score >= 2 & reading_score <= 4)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
greg | south | johnson | male | 2 | 2 |
marcia | south | johnson | female | 4 | 4 |
jan | north | smith | female | 4 | 4 |
alice | south | perry | female | 5 | 4 |
scores %>%
filter(reading_score %in% 2:4)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
greg | south | johnson | male | 2 | 2 |
marcia | south | johnson | female | 4 | 4 |
jan | north | smith | female | 4 | 4 |
alice | south | perry | female | 5 | 4 |
Exercise 18¶
Filter by students who have a name that starts with 'm'
.
Hint: type ?substr
in the console and then scroll to the bottom
of the help file to see useful examples.
# ?substr
scores %>%
filter(substr(name, 1, 1) == 'm')
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 |
marcia | south | johnson | female | 4 | 4 |
Part 4: Filter with groups¶
Exercise 19¶
Filter to teachers whose best math student got a score of $5$.
scores %>%
group_by(teacher) %>%
filter(max(math_score) == 5)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
peter | north | smith | male | 3 | 5 |
jan | north | smith | female | 4 | 4 |
bobby | north | smith | male | 5 | 1 |
cindy | south | perry | female | 4 | 5 |
alice | south | perry | female | 5 | 4 |
Note that all students from both teachers are shown.
Exercise 20¶
Filter to the sex with a mean math score of $4$.
scores %>%
group_by(sex) %>%
filter(mean(math_score) == 4)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
carol | south | johnson | female | 3 | 5 |
marcia | south | johnson | female | 4 | 4 |
jan | north | smith | female | 4 | 4 |
cindy | south | perry | female | 4 | 5 |
alice | south | perry | female | 5 | 4 |
Part 5: Mutate¶
Exercise 21¶
Set the math and reading scores to $10$ times their original values.
scores %>%
mutate(
math_score = math_score * 10,
reading_score = reading_score * 10
)
name | school | teacher | sex | math_score | reading_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
mike | south | johnson | male | 40 | 10 |
carol | south | johnson | female | 30 | 50 |
greg | south | johnson | male | 20 | 20 |
marcia | south | johnson | female | 40 | 40 |
peter | north | smith | male | 30 | 50 |
jan | north | smith | female | 40 | 40 |
bobby | north | smith | male | 50 | 10 |
cindy | south | perry | female | 40 | 50 |
alice | south | perry | female | 50 | 40 |
Exercise 22¶
Create a new column called math_reading_avg
which is the
average of a student's math and reading scores.
That is, combine the two scores and take their average.
scores %>%
mutate(math_reading_avg = (math_score + reading_score) / 2)
name | school | teacher | sex | math_score | reading_score | math_reading_avg |
---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 | 2.5 |
carol | south | johnson | female | 3 | 5 | 4.0 |
greg | south | johnson | male | 2 | 2 | 2.0 |
marcia | south | johnson | female | 4 | 4 | 4.0 |
peter | north | smith | male | 3 | 5 | 4.0 |
jan | north | smith | female | 4 | 4 | 4.0 |
bobby | north | smith | male | 5 | 1 | 3.0 |
cindy | south | perry | female | 4 | 5 | 4.5 |
alice | south | perry | female | 5 | 4 | 4.5 |
Exercise 23¶
Create a new column high_math_achiever
that is an
indicator of if a student got a 4 or better on their math score.
scores %>%
mutate(high_math_achiever = math_score >= 4)
name | school | teacher | sex | math_score | reading_score | high_math_achiever |
---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <lgl> |
mike | south | johnson | male | 4 | 1 | TRUE |
carol | south | johnson | female | 3 | 5 | FALSE |
greg | south | johnson | male | 2 | 2 | FALSE |
marcia | south | johnson | female | 4 | 4 | TRUE |
peter | north | smith | male | 3 | 5 | FALSE |
jan | north | smith | female | 4 | 4 | TRUE |
bobby | north | smith | male | 5 | 1 | TRUE |
cindy | south | perry | female | 4 | 5 | TRUE |
alice | south | perry | female | 5 | 4 | TRUE |
Exercise 24¶
Create a new column reading_score_centered
that is a
student's reading score minus the mean of all students' reading scores.
scores %>%
mutate(reading_score_centered = reading_score - mean(reading_score))
name | school | teacher | sex | math_score | reading_score | reading_score_centered |
---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 | -2.4444444 |
carol | south | johnson | female | 3 | 5 | 1.5555556 |
greg | south | johnson | male | 2 | 2 | -1.4444444 |
marcia | south | johnson | female | 4 | 4 | 0.5555556 |
peter | north | smith | male | 3 | 5 | 1.5555556 |
jan | north | smith | female | 4 | 4 | 0.5555556 |
bobby | north | smith | male | 5 | 1 | -2.4444444 |
cindy | south | perry | female | 4 | 5 | 1.5555556 |
alice | south | perry | female | 5 | 4 | 0.5555556 |
Note how this may be confusing: reading_score
is used twice in the mutate()
arugment.
In the first case, it refers to value for the current observation.
In the second case, it refers to the all the values.
Exercise 25¶
Create a new column called science_score
that adds the square of math_score
to reading_score
.
Display the result by sorting on the new column in descending order.
scores %>%
mutate(science_score = math_score**2 + reading_score) %>%
arrange(desc(science_score))
name | school | teacher | sex | math_score | reading_score | science_score |
---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
alice | south | perry | female | 5 | 4 | 29 |
bobby | north | smith | male | 5 | 1 | 26 |
cindy | south | perry | female | 4 | 5 | 21 |
marcia | south | johnson | female | 4 | 4 | 20 |
jan | north | smith | female | 4 | 4 | 20 |
mike | south | johnson | male | 4 | 1 | 17 |
carol | south | johnson | female | 3 | 5 | 14 |
peter | north | smith | male | 3 | 5 | 14 |
greg | south | johnson | male | 2 | 2 | 6 |
Part 6: Mutate with groups¶
Exercise 26¶
Create a math_score_centered_by_sex
column that subtracts a student's math_score
from the mean math_score
by sex.
Display the results in descending order on the new column.
scores %>%
group_by(sex) %>%
mutate(math_score_centered_by_sex = math_score - mean(math_score)) %>%
arrange(desc(math_score_centered_by_sex))
name | school | teacher | sex | math_score | reading_score | math_score_centered_by_sex |
---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
bobby | north | smith | male | 5 | 1 | 1.5 |
alice | south | perry | female | 5 | 4 | 1.0 |
mike | south | johnson | male | 4 | 1 | 0.5 |
marcia | south | johnson | female | 4 | 4 | 0.0 |
jan | north | smith | female | 4 | 4 | 0.0 |
cindy | south | perry | female | 4 | 5 | 0.0 |
peter | north | smith | male | 3 | 5 | -0.5 |
carol | south | johnson | female | 3 | 5 | -1.0 |
greg | south | johnson | male | 2 | 2 | -1.5 |
Question: If Mike and Cindy both got a $4$ for their math score, then
why does Mike have a higher math_score_centered_by_sex
score?
Answer: He is higher relative to the other males, which are lower overall.
Exercise 27¶
Create a reading_score_centered_by_teacher
column.
scores %>%
group_by(teacher) %>%
mutate(math_score_centered_by_teacher = math_score - mean(math_score))
name | school | teacher | sex | math_score | reading_score | math_score_centered_by_teacher |
---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
mike | south | johnson | male | 4 | 1 | 0.75 |
carol | south | johnson | female | 3 | 5 | -0.25 |
greg | south | johnson | male | 2 | 2 | -1.25 |
marcia | south | johnson | female | 4 | 4 | 0.75 |
peter | north | smith | male | 3 | 5 | -1.00 |
jan | north | smith | female | 4 | 4 | 0.00 |
bobby | north | smith | male | 5 | 1 | 1.00 |
cindy | south | perry | female | 4 | 5 | -0.50 |
alice | south | perry | female | 5 | 4 | 0.50 |
Exercise 28¶
Make a number_of_students_in_class
column that is number
of students in a teacher’s class.
For example, it should be $4$ for mike
and $3$ for peter
.
scores %>%
group_by(teacher) %>%
mutate(number_of_students_in_class = n())
name | school | teacher | sex | math_score | reading_score | number_of_students_in_class |
---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <int> |
mike | south | johnson | male | 4 | 1 | 4 |
carol | south | johnson | female | 3 | 5 | 4 |
greg | south | johnson | male | 2 | 2 | 4 |
marcia | south | johnson | female | 4 | 4 | 4 |
peter | north | smith | male | 3 | 5 | 3 |
jan | north | smith | female | 4 | 4 | 3 |
bobby | north | smith | male | 5 | 1 | 3 |
cindy | south | perry | female | 4 | 5 | 2 |
alice | south | perry | female | 5 | 4 | 2 |
Part 7: Summarize¶
Exercise 29¶
Use summarize()
command to find the mean math score for
all students.
scores %>%
summarize(math_score_mean = mean(math_score))
math_score_mean |
---|
<dbl> |
3.777778 |
Note how summarize()
creates a new column, like mutate()
.
It does this because it creates a new table.
Exercise 30¶
Find the mean reading score for all students.
scores %>%
summarize(math_score_mean = mean(reading_score))
math_score_mean |
---|
<dbl> |
3.444444 |
Exercise 31¶
Find the mean for math scores and reading scores added together.
Call the column both_score_mean
.
scores %>%
summarize(both_score_mean = mean((math_score + reading_score)/2))
both_score_mean |
---|
<dbl> |
3.611111 |
Part 8: Summarize with groups¶
Exercise 32¶
Find the minimum math score for each school.
scores %>%
group_by(school) %>%
summarize(min_math_score = min(math_score))
school | min_math_score |
---|---|
<chr> | <dbl> |
north | 3 |
south | 2 |
Exercise 33¶
Find the maximum math score for each teacher.
scores %>%
group_by(teacher) %>%
summarize(max_math_score = max(math_score))
teacher | max_math_score |
---|---|
<chr> | <dbl> |
johnson | 4 |
perry | 5 |
smith | 5 |
Excercise 11.19 Redone
scores %>%
group_by(teacher) %>%
summarize(max_math = max(math_score)) %>%
filter(max_math == 5) %>%
select(teacher)
teacher |
---|
<chr> |
perry |
smith |
Exercise 34¶
If we grouped by sex, and then summarized with the minimum reading score, how many rows would the resulting data frame have?
$2$, because there are $2$ values for sex in the table.
Group by with summarization operations yield dataframes with the same number of rows as distinct values in the grouping.
This is an example of the Split-Apply-Combine pattern.
- To group is to split by distinct value of some feature of set of features
- To summarize to apply some aggregate function to each group
- To combine means to bind the resulting single row tables into one table, with one row for each group
See this article by Brian S. Yandell, a professor of statistics at Wisconsin, for a deeper dive into this topic.
scores %>%
group_by(sex) %>%
summarize(min_reading_score = min(reading_score))
sex | min_reading_score |
---|---|
<chr> | <dbl> |
female | 4 |
male | 1 |
Exercise 35¶
Create a data frame with the mean and median reading score by sex, as well as the number of students of that sex.
scores %>%
group_by(sex) %>%
summarize(mean_reading_score = mean(reading_score),
median_reading_score = median(reading_score),
n = n())
sex | mean_reading_score | median_reading_score | n |
---|---|---|---|
<chr> | <dbl> | <dbl> | <int> |
female | 4.40 | 4.0 | 5 |
male | 2.25 | 1.5 | 4 |
Part 9: Combining verbs¶
Exercise 36¶
Select just the name
and math_score
columns.
Then create a new column math_score_ec
that is a students math score plus 5 extra
credit points.
Finally, arrange the data frame by math_score_ec
from low to high.
scores %>%
select(name, math_score) %>%
mutate(math_score_ec = math_score + 5) %>%
arrange(math_score_ec)
name | math_score | math_score_ec |
---|---|---|
<chr> | <dbl> | <dbl> |
greg | 2 | 7 |
carol | 3 | 8 |
peter | 3 | 8 |
mike | 4 | 9 |
marcia | 4 | 9 |
jan | 4 | 9 |
cindy | 4 | 9 |
bobby | 5 | 10 |
alice | 5 | 10 |
Exercise 37¶
Select every column except the teacher column.
Create a new variabled called mean_score
that is the mean of a student’s math and reading score.
Finally, arrange the data frame by mean_score
from low to high.
scores %>%
select(-teacher) %>%
mutate(mean_score = (math_score + reading_score)/2) %>%
arrange(mean_score)
name | school | sex | math_score | reading_score | mean_score |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
greg | south | male | 2 | 2 | 2.0 |
mike | south | male | 4 | 1 | 2.5 |
bobby | north | male | 5 | 1 | 3.0 |
carol | south | female | 3 | 5 | 4.0 |
marcia | south | female | 4 | 4 | 4.0 |
peter | north | male | 3 | 5 | 4.0 |
jan | north | female | 4 | 4 | 4.0 |
cindy | south | female | 4 | 5 | 4.5 |
alice | south | female | 5 | 4 | 4.5 |
Exercise 38¶
Remove any students with 'smith'
as a teacher, then find the
mean math_score
by sex
.
scores %>%
filter(teacher != "smith") %>%
group_by(sex) %>%
summarize(mean_math_score = mean(math_score))
sex | mean_math_score |
---|---|
<chr> | <dbl> |
female | 4 |
male | 3 |
Exercise 39¶
Find the min, max, and median reading_score
for female
students at south school.
scores %>%
filter(sex == "female", school == "south") %>%
summarize(min_reading_score = min(reading_score),
max_reading_score = max(reading_score),
median_reading_score = median(reading_score))
min_reading_score | max_reading_score | median_reading_score |
---|---|---|
<dbl> | <dbl> | <dbl> |
4 | 5 | 4.5 |
Exercise 40¶
Inspect each of the following code blocks. They both do about the same thing. Which one do you think is preffered from a computer efficiency standpoint?
Code block 1
scores %>%
group_by(school, teacher) %>%
summarize(max_math_score = max(math_score)) %>%
filter(school == "south")
`summarise()` has grouped output by 'school'. You can override using the `.groups` argument.
school | teacher | max_math_score |
---|---|---|
<chr> | <chr> | <dbl> |
south | johnson | 4 |
south | perry | 5 |
Code block 2
scores %>%
filter(school == "south") %>%
group_by(teacher) %>%
summarize(max_math_score = max(math_score))
teacher | max_math_score |
---|---|
<chr> | <dbl> |
johnson | 4 |
perry | 5 |
Answer: They both get the max math score by teacher for teachers at south school. The first block calculates the max_math_score
for both north and south and then filters out north after that calculation. The second block filters out north right away. This is preferred because it prevents the computer from making unnecessary calculations.
Regarding the error, we could have done this in the first code block:
scores %>%
group_by(school, teacher) %>%
summarize(max_math_score = max(math_score), .groups = 'keep') %>%
filter(school == "south")
school | teacher | max_math_score |
---|---|---|
<chr> | <chr> | <dbl> |
south | johnson | 4 |
south | perry | 5 |
Credits¶
This notebook is adapted largely from Stenhaug's "The 5 verbs of dplyr"