R Dplyr Exercises¶

Programming for Data Science Bootcamp

Exercise 1¶

Install Tidyverse

In [1]:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Exercise 2¶

We create a super simple tibble to use in the exercises.

You can find a text file called exercise-data.txt in the directory of today's module (M11).

Just cut and paste it into you code.

In [2]:
scores <- 
  tibble(
    name = c("mike", "carol", "greg", "marcia", "peter", "jan", "bobby", "cindy", "alice"),
    school = c("south", "south", "south", "south", "north", "north", "north", "south", "south"),
    teacher = c("johnson", "johnson", "johnson", "johnson",  "smith", "smith", "smith", "perry", "perry"),
    sex = c("male", "female", "male", "female", "male", "female", "male", "female", "female"),
    math_score = c(4, 3, 2, 4, 3, 4, 5, 4, 5),
    reading_score = c(1, 5, 2, 4, 5, 4, 1, 5, 4)
)

Exercise 3¶

View the tibble.

In [3]:
scores
A tibble: 9 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mike southjohnsonmale 41
carol southjohnsonfemale35
greg southjohnsonmale 22
marciasouthjohnsonfemale44
peter northsmith male 35
jan northsmith female44
bobby northsmith male 51
cindy southperry female45
alice southperry female54
In [52]:
print(scores)
# A tibble: 9 × 6
  name   school teacher sex    math_score reading_score
  <chr>  <chr>  <chr>   <chr>       <dbl>         <dbl>
1 mike   south  johnson male            4             1
2 carol  south  johnson female          3             5
3 greg   south  johnson male            2             2
4 marcia south  johnson female          4             4
5 peter  north  smith   male            3             5
6 jan    north  smith   female          4             4
7 bobby  north  smith   male            5             1
8 cindy  south  perry   female          4             5
9 alice  south  perry   female          5             4

Exercise 4¶

Make sure you understand the difference between

  • doing something and assigning it to a variable
  • and just doing it without assigning it

when using %>%.

First, just get the first three rows (using slice()).

Then, assign the first three rows to a variable.

In [53]:
scores %>% 
  slice(1:3)
A tibble: 3 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mike southjohnsonmale 41
carolsouthjohnsonfemale35
greg southjohnsonmale 22

Get the first 3 rows, and assign it to a new name scores_small.

In [54]:
scores_small <- scores %>% 
  slice(1:3)

See what's in scores_small.

In [55]:
scores_small
A tibble: 3 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mike southjohnsonmale 41
carolsouthjohnsonfemale35
greg southjohnsonmale 22

Another option:

In [56]:
scores %>% head(3)
A tibble: 3 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mike southjohnsonmale 41
carolsouthjohnsonfemale35
greg southjohnsonmale 22

And the old way:

In [59]:
scores[1:3,]
A tibble: 3 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mike southjohnsonmale 41
carolsouthjohnsonfemale35
greg southjohnsonmale 22

Part 1: Arrange¶

Exercise 5¶

Sort the data by math_score from high to low.

Who had the best math score?

In [60]:
scores %>% 
  arrange(desc(math_score))
A tibble: 9 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
bobby northsmith male 51
alice southperry female54
mike southjohnsonmale 41
marciasouthjohnsonfemale44
jan northsmith female44
cindy southperry female45
carol southjohnsonfemale35
peter northsmith male 35
greg southjohnsonmale 22

Answer: Bobby and Alice both tied for the highest math score

Exercise 6¶

Sort the data by name in alphabetical order.

In [63]:
scores %>% 
  arrange(name)
A tibble: 9 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
alice southperry female54
bobby northsmith male 51
carol southjohnsonfemale35
cindy southperry female45
greg southjohnsonmale 22
jan northsmith female44
marciasouthjohnsonfemale44
mike southjohnsonmale 41
peter northsmith male 35

Exercise 7¶

Sort the data by sex so females show up first.

In [64]:
scores %>% 
  arrange(sex)
A tibble: 9 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
carol southjohnsonfemale35
marciasouthjohnsonfemale44
jan northsmith female44
cindy southperry female45
alice southperry female54
mike southjohnsonmale 41
greg southjohnsonmale 22
peter northsmith male 35
bobby northsmith male 51

Exercise 8¶

Sort the data by school, teacher, sex, math_score, and reading_score.

In [67]:
scores %>%
    arrange(school, teacher, sex, math_score, reading_score)
A tibble: 9 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
jan northsmith female44
peter northsmith male 35
bobby northsmith male 51
carol southjohnsonfemale35
marciasouthjohnsonfemale44
greg southjohnsonmale 22
mike southjohnsonmale 41
cindy southperry female45
alice southperry female54

Part 2: Select¶

Exercise 9¶

Select only name, math_score, and reading_score.

In [68]:
scores %>% 
  select(name, math_score, reading_score)
A tibble: 9 × 3
namemath_scorereading_score
<chr><dbl><dbl>
mike 41
carol 35
greg 22
marcia44
peter 35
jan 44
bobby 51
cindy 45
alice 54

Exercise 10¶

Select all of the columns except the sex column.

In [69]:
scores %>% 
  select(-sex)
A tibble: 9 × 5
nameschoolteachermath_scorereading_score
<chr><chr><chr><dbl><dbl>
mike southjohnson41
carol southjohnson35
greg southjohnson22
marciasouthjohnson44
peter northsmith 35
jan northsmith 44
bobby northsmith 51
cindy southperry 45
alice southperry 54

Exercise 11¶

Select all of the columns except the math_score and reading_score columns.

In [70]:
scores %>% 
  select(-sex, -reading_score)
A tibble: 9 × 4
nameschoolteachermath_score
<chr><chr><chr><dbl>
mike southjohnson4
carol southjohnson3
greg southjohnson2
marciasouthjohnson4
peter northsmith 3
jan northsmith 4
bobby northsmith 5
cindy southperry 4
alice southperry 5

Exercise 12¶

Keep all of the columns but rearrange them so that sex is the first column.

In [71]:
scores %>% 
  select(sex, everything())
A tibble: 9 × 6
sexnameschoolteachermath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
male mike southjohnson41
femalecarol southjohnson35
male greg southjohnson22
femalemarciasouthjohnson44
male peter northsmith 35
femalejan northsmith 44
male bobby northsmith 51
femalecindy southperry 45
femalealice southperry 54

Part 3: Filter¶

Exercise 13¶

Filter by students who are male and went to south.

Option 1

In [17]:
scores %>% 
  filter(sex == "male" & school == "south")
A tibble: 2 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mikesouthjohnsonmale41
gregsouthjohnsonmale22

Option 2

Using a comma , instead of &.

In [18]:
scores %>% 
  filter(sex == "male", school == "south")
A tibble: 2 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mikesouthjohnsonmale41
gregsouthjohnsonmale22

Exercise 14¶

Filter by students who did above average in math.

In [72]:
scores %>%
    filter(math_score > mean(math_score))
A tibble: 6 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mike southjohnsonmale 41
marciasouthjohnsonfemale44
jan northsmith female44
bobby northsmith male 51
cindy southperry female45
alice southperry female54

Exercise 15¶

Use filter() to figure out how many students had a math score of $4$ or more and a reading score of $3$ or more.

In [73]:
scores %>%
    filter(math_score >= 4 & reading_score >= 3)
A tibble: 4 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
marciasouthjohnsonfemale44
jan northsmith female44
cindy southperry female45
alice southperry female54

Exercise 16¶

Filter by students who got a $3$ or worse in either math or reading.

In [74]:
scores %>%
    filter(math_score <= 3 | reading_score <= 3)
A tibble: 5 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mike southjohnsonmale 41
carolsouthjohnsonfemale35
greg southjohnsonmale 22
peternorthsmith male 35
bobbynorthsmith male 51

Exercise 17¶

Filter by students who got a reading score of $2$, $3$, or $4$.

In [76]:
scores %>%
    filter(reading_score == 2 | reading_score == 3 | reading_score == 4)
A tibble: 4 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
greg southjohnsonmale 22
marciasouthjohnsonfemale44
jan northsmith female44
alice southperry female54
In [77]:
scores %>%
    filter(reading_score >= 2 & reading_score <= 4)
A tibble: 4 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
greg southjohnsonmale 22
marciasouthjohnsonfemale44
jan northsmith female44
alice southperry female54
In [78]:
scores %>%
    filter(reading_score %in% 2:4)
A tibble: 4 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
greg southjohnsonmale 22
marciasouthjohnsonfemale44
jan northsmith female44
alice southperry female54

Exercise 18¶

Filter by students who have a name that starts with 'm'.

Hint: type ?substr in the console and then scroll to the bottom of the help file to see useful examples.

In [79]:
# ?substr
In [80]:
scores %>%
    filter(substr(name, 1, 1) == 'm')
A tibble: 2 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mike southjohnsonmale 41
marciasouthjohnsonfemale44

Part 4: Filter with groups¶

Exercise 19¶

Filter to teachers whose best math student got a score of $5$.

In [81]:
scores %>% 
  group_by(teacher) %>% 
  filter(max(math_score) == 5)
A grouped_df: 5 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
peternorthsmithmale 35
jan northsmithfemale44
bobbynorthsmithmale 51
cindysouthperryfemale45
alicesouthperryfemale54

Note that all students from both teachers are shown.

Exercise 20¶

Filter to the sex with a mean math score of $4$.

In [91]:
scores %>%
    group_by(sex) %>%
    filter(mean(math_score) == 4)
A grouped_df: 5 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
carol southjohnsonfemale35
marciasouthjohnsonfemale44
jan northsmith female44
cindy southperry female45
alice southperry female54

Part 5: Mutate¶

Exercise 21¶

Set the math and reading scores to $10$ times their original values.

In [94]:
scores %>% 
  mutate(
      math_score =  math_score * 10, 
      reading_score = reading_score * 10
  )
A tibble: 9 × 6
nameschoolteachersexmath_scorereading_score
<chr><chr><chr><chr><dbl><dbl>
mike southjohnsonmale 4010
carol southjohnsonfemale3050
greg southjohnsonmale 2020
marciasouthjohnsonfemale4040
peter northsmith male 3050
jan northsmith female4040
bobby northsmith male 5010
cindy southperry female4050
alice southperry female5040

Exercise 22¶

Create a new column called math_reading_avg which is the average of a student's math and reading scores.

That is, combine the two scores and take their average.

In [97]:
scores %>% 
    mutate(math_reading_avg = (math_score + reading_score) / 2)
A tibble: 9 × 7
nameschoolteachersexmath_scorereading_scoremath_reading_avg
<chr><chr><chr><chr><dbl><dbl><dbl>
mike southjohnsonmale 412.5
carol southjohnsonfemale354.0
greg southjohnsonmale 222.0
marciasouthjohnsonfemale444.0
peter northsmith male 354.0
jan northsmith female444.0
bobby northsmith male 513.0
cindy southperry female454.5
alice southperry female544.5

Exercise 23¶

Create a new column high_math_achiever that is an indicator of if a student got a 4 or better on their math score.

In [100]:
scores %>% 
  mutate(high_math_achiever = math_score >= 4)
A tibble: 9 × 7
nameschoolteachersexmath_scorereading_scorehigh_math_achiever
<chr><chr><chr><chr><dbl><dbl><lgl>
mike southjohnsonmale 41 TRUE
carol southjohnsonfemale35FALSE
greg southjohnsonmale 22FALSE
marciasouthjohnsonfemale44 TRUE
peter northsmith male 35FALSE
jan northsmith female44 TRUE
bobby northsmith male 51 TRUE
cindy southperry female45 TRUE
alice southperry female54 TRUE

Exercise 24¶

Create a new column reading_score_centered that is a student's reading score minus the mean of all students' reading scores.

In [101]:
scores %>% 
  mutate(reading_score_centered = reading_score - mean(reading_score))
A tibble: 9 × 7
nameschoolteachersexmath_scorereading_scorereading_score_centered
<chr><chr><chr><chr><dbl><dbl><dbl>
mike southjohnsonmale 41-2.4444444
carol southjohnsonfemale35 1.5555556
greg southjohnsonmale 22-1.4444444
marciasouthjohnsonfemale44 0.5555556
peter northsmith male 35 1.5555556
jan northsmith female44 0.5555556
bobby northsmith male 51-2.4444444
cindy southperry female45 1.5555556
alice southperry female54 0.5555556

Note how this may be confusing: reading_score is used twice in the mutate() arugment.

In the first case, it refers to value for the current observation.

In the second case, it refers to the all the values.

Exercise 25¶

Create a new column called science_score that adds the square of math_score to reading_score.

Display the result by sorting on the new column in descending order.

In [114]:
scores %>% 
    mutate(science_score = math_score**2 + reading_score) %>%
    arrange(desc(science_score))
A tibble: 9 × 7
nameschoolteachersexmath_scorereading_scorescience_score
<chr><chr><chr><chr><dbl><dbl><dbl>
alice southperry female5429
bobby northsmith male 5126
cindy southperry female4521
marciasouthjohnsonfemale4420
jan northsmith female4420
mike southjohnsonmale 4117
carol southjohnsonfemale3514
peter northsmith male 3514
greg southjohnsonmale 22 6

Part 6: Mutate with groups¶

Exercise 26¶

Create a math_score_centered_by_sex column that subtracts a student's math_score from the mean math_score by sex.

Display the results in descending order on the new column.

In [115]:
scores %>% 
  group_by(sex) %>% 
  mutate(math_score_centered_by_sex = math_score - mean(math_score)) %>%
  arrange(desc(math_score_centered_by_sex))
A grouped_df: 9 × 7
nameschoolteachersexmath_scorereading_scoremath_score_centered_by_sex
<chr><chr><chr><chr><dbl><dbl><dbl>
bobby northsmith male 51 1.5
alice southperry female54 1.0
mike southjohnsonmale 41 0.5
marciasouthjohnsonfemale44 0.0
jan northsmith female44 0.0
cindy southperry female45 0.0
peter northsmith male 35-0.5
carol southjohnsonfemale35-1.0
greg southjohnsonmale 22-1.5

Question: If Mike and Cindy both got a $4$ for their math score, then
why does Mike have a higher math_score_centered_by_sex score?

Answer: He is higher relative to the other males, which are lower overall.

Exercise 27¶

Create a reading_score_centered_by_teacher column.

In [117]:
scores %>% 
  group_by(teacher) %>% 
  mutate(math_score_centered_by_teacher = math_score - mean(math_score))
A grouped_df: 9 × 7
nameschoolteachersexmath_scorereading_scoremath_score_centered_by_teacher
<chr><chr><chr><chr><dbl><dbl><dbl>
mike southjohnsonmale 41 0.75
carol southjohnsonfemale35-0.25
greg southjohnsonmale 22-1.25
marciasouthjohnsonfemale44 0.75
peter northsmith male 35-1.00
jan northsmith female44 0.00
bobby northsmith male 51 1.00
cindy southperry female45-0.50
alice southperry female54 0.50

Exercise 28¶

Make a number_of_students_in_class column that is number of students in a teacher’s class.

For example, it should be $4$ for mike and $3$ for peter.

In [119]:
scores %>% 
  group_by(teacher) %>% 
  mutate(number_of_students_in_class = n())
A grouped_df: 9 × 7
nameschoolteachersexmath_scorereading_scorenumber_of_students_in_class
<chr><chr><chr><chr><dbl><dbl><int>
mike southjohnsonmale 414
carol southjohnsonfemale354
greg southjohnsonmale 224
marciasouthjohnsonfemale444
peter northsmith male 353
jan northsmith female443
bobby northsmith male 513
cindy southperry female452
alice southperry female542

Part 7: Summarize¶

Exercise 29¶

Use summarize() command to find the mean math score for all students.

In [121]:
scores %>% 
  summarize(math_score_mean = mean(math_score))
A tibble: 1 × 1
math_score_mean
<dbl>
3.777778

Note how summarize() creates a new column, like mutate().

It does this because it creates a new table.

Exercise 30¶

Find the mean reading score for all students.

In [127]:
scores %>% 
  summarize(math_score_mean = mean(reading_score))
A tibble: 1 × 1
math_score_mean
<dbl>
3.444444

Exercise 31¶

Find the mean for math scores and reading scores added together.

Call the column both_score_mean.

In [128]:
scores %>% 
  summarize(both_score_mean = mean((math_score + reading_score)/2))
A tibble: 1 × 1
both_score_mean
<dbl>
3.611111

Part 8: Summarize with groups¶

Exercise 32¶

Find the minimum math score for each school.

In [129]:
scores %>% 
  group_by(school) %>% 
  summarize(min_math_score = min(math_score))
A tibble: 2 × 2
schoolmin_math_score
<chr><dbl>
north3
south2

Exercise 33¶

Find the maximum math score for each teacher.

In [130]:
scores %>% 
  group_by(teacher) %>% 
  summarize(max_math_score = max(math_score))
A tibble: 3 × 2
teachermax_math_score
<chr><dbl>
johnson4
perry 5
smith 5

Excercise 11.19 Redone

In [132]:
scores %>% 
    group_by(teacher) %>% 
    summarize(max_math = max(math_score)) %>%
    filter(max_math == 5) %>%
    select(teacher)
A tibble: 2 × 1
teacher
<chr>
perry
smith

Exercise 34¶

If we grouped by sex, and then summarized with the minimum reading score, how many rows would the resulting data frame have?

$2$, because there are $2$ values for sex in the table.

Group by with summarization operations yield dataframes with the same number of rows as distinct values in the grouping.

This is an example of the Split-Apply-Combine pattern.

  • To group is to split by distinct value of some feature of set of features
  • To summarize to apply some aggregate function to each group
  • To combine means to bind the resulting single row tables into one table, with one row for each group

See this article by Brian S. Yandell, a professor of statistics at Wisconsin, for a deeper dive into this topic.

In [135]:
scores %>% 
  group_by(sex) %>%
  summarize(min_reading_score = min(reading_score))
A tibble: 2 × 2
sexmin_reading_score
<chr><dbl>
female4
male 1

Exercise 35¶

Create a data frame with the mean and median reading score by sex, as well as the number of students of that sex.

In [137]:
scores %>% 
  group_by(sex) %>% 
  summarize(mean_reading_score = mean(reading_score), 
            median_reading_score = median(reading_score), 
            n = n())
A tibble: 2 × 4
sexmean_reading_scoremedian_reading_scoren
<chr><dbl><dbl><int>
female4.404.05
male 2.251.54

Part 9: Combining verbs¶

Exercise 36¶

Select just the name and math_score columns.

Then create a new column math_score_ec that is a students math score plus 5 extra credit points.

Finally, arrange the data frame by math_score_ec from low to high.

In [139]:
scores %>% 
  select(name, math_score) %>% 
  mutate(math_score_ec = math_score + 5) %>% 
  arrange(math_score_ec)
A tibble: 9 × 3
namemath_scoremath_score_ec
<chr><dbl><dbl>
greg 2 7
carol 3 8
peter 3 8
mike 4 9
marcia4 9
jan 4 9
cindy 4 9
bobby 510
alice 510

Exercise 37¶

Select every column except the teacher column.

Create a new variabled called mean_score that is the mean of a student’s math and reading score.

Finally, arrange the data frame by mean_score from low to high.

In [141]:
scores %>% 
  select(-teacher) %>% 
  mutate(mean_score = (math_score + reading_score)/2) %>% 
  arrange(mean_score)
A tibble: 9 × 6
nameschoolsexmath_scorereading_scoremean_score
<chr><chr><chr><dbl><dbl><dbl>
greg southmale 222.0
mike southmale 412.5
bobby northmale 513.0
carol southfemale354.0
marciasouthfemale444.0
peter northmale 354.0
jan northfemale444.0
cindy southfemale454.5
alice southfemale544.5

Exercise 38¶

Remove any students with 'smith' as a teacher, then find the mean math_score by sex.

In [143]:
scores %>% 
  filter(teacher != "smith") %>% 
  group_by(sex) %>% 
  summarize(mean_math_score = mean(math_score))
A tibble: 2 × 2
sexmean_math_score
<chr><dbl>
female4
male 3

Exercise 39¶

Find the min, max, and median reading_score for female students at south school.

In [144]:
scores %>% 
  filter(sex == "female", school == "south") %>% 
  summarize(min_reading_score = min(reading_score),
            max_reading_score = max(reading_score),
            median_reading_score = median(reading_score))
A tibble: 1 × 3
min_reading_scoremax_reading_scoremedian_reading_score
<dbl><dbl><dbl>
454.5

Exercise 40¶

Inspect each of the following code blocks. They both do about the same thing. Which one do you think is preffered from a computer efficiency standpoint?

Code block 1

In [157]:
scores %>% 
  group_by(school, teacher) %>% 
  summarize(max_math_score = max(math_score)) %>% 
  filter(school == "south")
`summarise()` has grouped output by 'school'. You can override using the
`.groups` argument.
A grouped_df: 2 × 3
schoolteachermax_math_score
<chr><chr><dbl>
southjohnson4
southperry 5

Code block 2

In [146]:
scores %>% 
  filter(school == "south") %>% 
  group_by(teacher) %>% 
  summarize(max_math_score = max(math_score))
A tibble: 2 × 2
teachermax_math_score
<chr><dbl>
johnson4
perry 5

Answer: They both get the max math score by teacher for teachers at south school. The first block calculates the max_math_score for both north and south and then filters out north after that calculation. The second block filters out north right away. This is preferred because it prevents the computer from making unnecessary calculations.

Regarding the error, we could have done this in the first code block:

In [159]:
scores %>% 
  group_by(school, teacher) %>% 
  summarize(max_math_score = max(math_score), .groups = 'keep') %>% 
  filter(school == "south")
A grouped_df: 2 × 3
schoolteachermax_math_score
<chr><chr><dbl>
southjohnson4
southperry 5

Credits¶

This notebook is adapted largely from Stenhaug's "The 5 verbs of dplyr"