R Visualization Exercises¶
Programming for Data Science Bootcamp
Import Libraries¶
library(vctrs)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.0.4 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::data_frame() masks tibble::data_frame(), vctrs::data_frame() ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Get Data¶
head(mpg)
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <dbl> | <int> | <int> | <chr> | <chr> | <int> | <int> | <chr> | <chr> |
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
Exercise 1¶
Run mpg %>% ggplot()
what do you see?
mpg %>% ggplot()
Exercise 2¶
Make a scatter plot of hwy
vs. cyl
in the mpg
data set.
mpg %>%
ggplot(aes(x = cyl, y = hwy)) +
geom_point()
aes
is actually a value for the mapping
argument.
mpg %>%
ggplot(mapping = aes(x = cyl, y = hwy)) +
geom_point()
aes
can also be treated as a separate operation.
mpg %>%
ggplot() +
aes(x = cyl, y = hwy) +
geom_point()
mpg %>%
ggplot(aes(x = class, y = drv)) +
geom_point()
The resulting scatterplot has only a few points.
A scatter plot is not a useful display of these variables since both drv
and class
are categorical variables.
👉 Categorical variables typically take a small number of values, so there are a limited number of unique combinations of $(x, y)$ values that can be displayed.
👉 In this data, drv
takes $3$ values and class
takes $7$ values, meaning that there are only $21$ values that could be plotted on a scatterplot of drv
vs. class
.
👉 In this data, there $12$ values of (drv
, class
) are observed.
Exercise 4¶
Plot the mathematical function $sin(x)/x$.
Hint: Use this to create your data and convert to a data frame or tibble:
x <- seq(-6 * pi, 6 * pi, length.out = 100)
x <- seq(-6 * pi, 6 * pi, length.out = 100)
dat <- data.frame(x = x, y = sin(x)/x)
head(dat)
x | y | |
---|---|---|
<dbl> | <dbl> | |
1 | -18.84956 | -3.898172e-17 |
2 | -18.46876 | -2.012385e-02 |
3 | -18.08796 | -3.815130e-02 |
4 | -17.70716 | -5.137086e-02 |
5 | -17.32636 | -5.765016e-02 |
6 | -16.94556 | -5.576687e-02 |
ggplot(data = dat,
mapping = aes(x = x, y = y)) +
geom_line()
dat %>%
ggplot(aes(x = x, y = y)) +
geom_line()
Exercise 5¶
Plot the cars
data set as a scatter plot using speed
vs dist
.
cars %>%
ggplot(aes(x = speed, y = dist)) +
geom_point()
Exercise 6¶
Create the same plot plot, this time using color to distinguish data points with distances taken to stop greater than $80$.
head(cars)
speed | dist | |
---|---|---|
<dbl> | <dbl> | |
1 | 4 | 2 |
2 | 4 | 10 |
3 | 7 | 4 |
4 | 7 | 22 |
5 | 8 | 16 |
6 | 9 | 10 |
cars %>%
ggplot(aes(x = speed, y = dist)) +
geom_point(mapping = aes(color = dist > 80))
cars %>%
ggplot() +
aes(x = speed, y = dist) +
geom_point() +
aes(color = dist > 80)
Exercise 7¶
Change the plot so that values $> 80$ are in red and the other in blue.
Hint: Define the colors using a manual color scale in scale_color_manual()
.
cars %>%
ggplot(aes(x = speed, y = dist)) +
geom_point(mapping = aes(color = dist > 80)) +
scale_color_manual(values = c("blue", "red"))
Another way, using ifelse()
:
cars %>%
ggplot(aes(speed, dist)) +
geom_point(color = ifelse(cars$dist > 80, 'red', 'blue'))
Exercise 8¶
Add a second geom
that produces a smoothed line.
Use lm
as your smoothing method.
Hint: Add geom_smooth()
to your graphic.
cars %>%
ggplot(aes(x = speed, y = dist)) +
geom_point(aes(color = dist > 80)) +
scale_color_manual(values = c("black", "red")) +
geom_smooth(method = 'lm')
`geom_smooth()` using formula = 'y ~ x'
Smoothing method (function) to use includes: lm
, glm
, gam
, loess
, rlm
.
loess
: locally weighted smoothing
cars %>%
ggplot(aes(x = speed, y = dist)) +
geom_point(aes(color = dist > 80)) +
scale_color_manual(values = c("black", "red")) +
geom_smooth(method = 'loess')
`geom_smooth()` using formula = 'y ~ x'
Exercise 9¶
Plot histograms for speed
and dist
in cars
.
cars %>%
ggplot(aes(x = speed)) +
geom_histogram(bins = 10)
cars %>%
ggplot(aes(x = dist)) +
geom_histogram(bins = 10)
Exercise 10¶
Create a faceted plot of a scatterplot of hwy
and cty
of the mpg
with drv
as rows and cyl
as cols.
What do the empty cells mean?
mpg %>%
ggplot() +
geom_point(aes(x = hwy, y = cty)) +
facet_grid(drv ~ cyl)
The empty cells (facets) in this plot are combinations of drv and cyl that have no observations.
These are the same locations in the scatter plot of drv and cyl that have no points.
mpg %>%
ggplot() +
geom_point(aes(y = drv, x = cyl))
Without faceting:
mpg %>%
ggplot() +
geom_point(aes(x = hwy, y = cty,
color=drv,
shape=as.factor(cyl)))
Exercise 11¶
Reproduce this graphic from the iris dataset:
Hint: This graphic uses two geometries, one title, and one theme function.
One of the geometries is geom_density2d()
and the theme function is theme_light()
.
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species, shape = Species)) +
geom_point() +
geom_density2d() +
ggtitle('IRIS') +
theme_light()
Exercise 12¶
Reproduce this graphc from the iris dataset:
Hints:
(1) Preprocess your data as follows:
{r}
iris %>%
mutate(Species = 'ALL') %>% # Create a copy of iris where Species has only 'ALL'
bind_rows(iris) # concatenate to the original iris
(2) This graphic uses faceting with facet_wrap()
and theme function theme_bw()
.
iris %>%
mutate(Species = 'ALL') %>%
bind_rows(iris) %>%
ggplot(aes(x = Petal.Length, y = Petal.Width, color = Species)) +
geom_point() +
geom_smooth(method = 'loess') +
xlab('Petal Length') +
ylab('Petal Width') +
facet_wrap(~Species, scales = 'free') +
theme_bw()
`geom_smooth()` using formula = 'y ~ x'
mtcars %>%
rownames_to_column() %>%
mutate(rowname = forcats::fct_reorder(rowname, mpg)) %>%
ggplot(aes(rowname, mpg, label = rowname)) +
geom_point() +
geom_text(nudge_y = .3, hjust = 'left') +
coord_flip() +
ylab('Miles per gallon fuel consumption') +
ylim(10, 40) +
theme_classic() +
theme(plot.title = element_text(hjust = 0, size = 16),
axis.title.x = element_text(face = 'bold'),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.line.y = element_blank())
Exercise 14¶
Reproduce this graphic using the mtcars dataset:
mtcars %>%
ggplot(aes(x = mpg, y = qsec, size = disp, color = as.factor(am))) +
geom_point() +
scale_colour_discrete(name ="Gear",
breaks=c(0, 1),
labels=c("Manual", "Automatic")) +
scale_size_continuous(name = 'Displacement') +
xlab('Miles per gallon') +
ylab('1/4 mile time') +
theme_light()
Exercise 15¶
Reproduce this image using the diamonds dataset:
https://www.r-exercises.com/wp-content/uploads/2018/02/ggplot-exercises-5.png
Process the data.
diamonds2plot <- diamonds %>%
group_by(cut, color) %>%
# See note re .groups in the following
summarize(price = mean(price), .groups = 'drop') %>%
arrange(color, price) %>%
ungroup() %>%
mutate(id = row_number(),
angle = 90 - 360 * (id - 0.5) / n())
Note the we add the .groups = 'drop'
argument to summarize()
to avoid this error message:
summarise()
has grouped output by 'cut'. You can override using the.groups
argument.
Build the visualization.
diamonds2plot %>%
ggplot(aes(factor(id), price, fill = color, group = cut, label = cut)) +
geom_bar(stat = 'identity', position = 'dodge') +
geom_text(hjust = 0, angle = diamonds2plot$angle, alpha = .5) +
coord_polar() +
ggtitle('Mean dimond price') +
ylim(-3000, 7000) +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = 'bold'))