NB: Introducing Tibbles

Based on Wikham and Grolemund 2017

The Tidyverse

Tidyverse is a collection of essential R packages for data science.

The packages included in the Tidyverse are designed to support the pipeline of activities associated with data science, such as filtering, transforming, visualizing, etc.

Tidyverse was created by Hadley Wickham and his team with the aim of providing all these utilities to clean and work with data.

Here’s a graphic of the packages associated with the Tidyverse:

Dplyr

Dplyr introduces new set of functions that make working with data more intuitive.

  • It does this by introducing a set of functions that work together well to produce pipelines of actions.

But as important, it introduces a vocabulary for talking about data.

  • This makes it possible to imagine solutions verbally, and then to implement them in code.

To use the Tidyverse, we often import everything:

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.5     ✔ purrr   1.0.1
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Piping with %>%

Just a quick note about this odd looking operator that you will start to see.

One of the key elements of the Tidyverse is the use of piping, or the ability of to pass the return values of one function to another, with without have to nest functions.

For example, instead of something like this:

a <- "Hello"
b <- "World"

var1 <- c(a, b)
var2 <- paste(var1)
print(var2)
[1] "Hello" "World"

Or this:

print(paste(c(a,b)))
[1] "Hello" "World"

We can do:

c(a, b) %>%
  paste() %>%
  print()
[1] "Hello" "World"

Although the last pattern is longer than the preceding, it is much easier to read and write, especially when we are working with several connected functions.

This is similar to method chaining in Python, but is more pervasive.

  • In Python you can do it with individual objects that return themselves (as it were).
  • In Tidyverse, you can apply it to any two methods so long as it makes sense to pass the output of as the input of another.
  • Basically, the output of one function becomes the first argument of the following the %>%.

It is similar to the pipe operator | in Unix shells.

By the way, the operator comes with the magrittr package, which is a central part of the Tidyverse. It is so central, in fact, that packages in the tidyverse load %>% automatically.

It provides a set of operators which make your code more readable.

Tibbles

Dplyr can work with different rectangular data structures:

  • Plain old Dataframes
  • Tibbles
  • Data.tables (see data.table)

The foundational data structure of the Tidyverse is the tibble.

Tibbles are data frames, but they tweak some older behaviors to make your life a little easier.

To learn more about tibbles, check out the vignette:

vignette("tibble")
starting httpd help server ... done

Creating tibbles

If you need to make a tibble “by hand”, you can use tibble() or tribble() (see below).

tibble() works by assembling individual vectors, column-wise operation:

x <- c(1, 2, 5)
y <- c("a", "b", "h")
tibble(x, y)
# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 b    
3     5 h    

You can also optionally name the inputs, provide data inline with c(), and perform computation:

tibble(
  x1 = x,
  x2 = c(10, 15, 25),
  y = sqrt(x1^2 + x2^2)
)
# A tibble: 3 × 3
     x1    x2     y
  <dbl> <dbl> <dbl>
1     1    10  10.0
2     2    15  15.1
3     5    25  25.5

Every column in a data frame or tibble must be same length, so you’ll get an error if the lengths are different:

As the error suggests, individual values will be recycled to the same length as everything else:

tibble(
  x = 1:5,
  y = "a",
  z = TRUE
)
# A tibble: 5 × 3
      x y     z    
  <int> <chr> <lgl>
1     1 a     TRUE 
2     2 a     TRUE 
3     3 a     TRUE 
4     4 a     TRUE 
5     5 a     TRUE 

Tribbles

Another way to create a tibble is with tribble(), which short for transposed tibble.

tribble() is customized for data entry in code: column headings start with ~ and entries are separated by commas.

This makes it possible to lay out small amounts of data in an easy to read form:

tribble(
  ~x, ~y, ~z,
  "a", 2, 3.6,
  "b", 1, 8.5
)
# A tibble: 2 × 3
  x         y     z
  <chr> <dbl> <dbl>
1 a         2   3.6
2 b         1   8.5

Finally, if you have a regular data frame you can turn it into to a tibble with as_tibble():

as_tibble(mtcars)
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

The inverse of as_tibble() is as.data.frame(); it converts a tibble back into a regular data.frame.

Non-syntactic names

It’s possible for a tibble to have column names that are not valid R variable names, names that are non-syntactic.

For example, the variables might not start with a letter or they might contain unusual characters like a space.

To refer to these variables, you need to surround them with backticks, `:

tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb
# A tibble: 1 × 3
  `:)`  ` `   `2000`
  <chr> <chr> <chr> 
1 smile space number

You’ll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.

Tibbles vs. data.frame

There are two main differences in the usage of a tibble vs. a classic data.frame: printing and subsetting.

Tip

If these differences cause problems when working with older packages, you can turn a tibble back to a regular data frame with as.data.frame().

Printing

The print method:

  • Only the first 10 rows
  • All the columns that fit on screen

This makes it much easier to work with large data.

tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)
# A tibble: 1,000 × 5
   a                   b              c      d e    
   <dttm>              <date>     <int>  <dbl> <chr>
 1 2023-08-01 19:48:54 2023-08-12     1 0.564  v    
 2 2023-08-02 00:35:28 2023-08-17     2 0.0967 c    
 3 2023-08-02 13:42:30 2023-08-19     3 0.453  s    
 4 2023-08-02 05:56:08 2023-08-21     4 0.709  a    
 5 2023-08-02 15:22:58 2023-08-07     5 0.177  c    
 6 2023-08-01 22:38:30 2023-08-08     6 0.797  x    
 7 2023-08-02 01:17:24 2023-08-08     7 0.0214 b    
 8 2023-08-02 15:12:09 2023-08-30     8 0.264  m    
 9 2023-08-02 01:27:14 2023-08-27     9 0.705  c    
10 2023-08-02 03:44:58 2023-08-08    10 0.638  i    
# ℹ 990 more rows

Where possible, tibbles also use color to draw your eye to important differences.

One of the most important distinctions is between the string "NA" and the missing value, NA:

tibble(x = c("NA", NA))
# A tibble: 2 × 1
  x    
  <chr>
1 NA   
2 <NA> 

Tibbles are designed to avoid overwhelming your console when you print large data frames.

But sometimes you need more output than the default display.

There are a few options that can help.

First, you can explicitly print() the data frame and control the number of rows (n) and the width of the display. width = Inf will display all columns:

library(nycflights13)
flights %>%
  print(n = 10, width = Inf)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
   arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
       <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
 1        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
 2        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
 3        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
 4       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
 5       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
 6        12 UA        1696 N39463  EWR    ORD        150      719     5     58
 7        19 B6         507 N516JB  EWR    FLL        158     1065     6      0
 8       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0
 9        -8 B6          79 N593JB  JFK    MCO        140      944     6      0
10         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0
   time_hour          
   <dttm>             
 1 2013-01-01 05:00:00
 2 2013-01-01 05:00:00
 3 2013-01-01 05:00:00
 4 2013-01-01 05:00:00
 5 2013-01-01 06:00:00
 6 2013-01-01 05:00:00
 7 2013-01-01 06:00:00
 8 2013-01-01 06:00:00
 9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ℹ 336,766 more rows

You can also control the default print behavior by setting options:

  • options(tibble.print_max = n, tibble.print_min = m): if more than n rows, print only m rows.
  • Use options(tibble.print_min = Inf) to always show all rows.
  • Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.

You can see a complete list of options by looking at the package help with package?tibble.

Using RStudio View()

A final option is to use RStudio’s built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.

flights %>%
  View()

Extracting variables

So far all the tools you’ve learned have worked with complete data frames.

If you want to pull out a single variable, you can use dplyr::pull():

tb <- tibble(
  id = LETTERS[1:5],
  x1  = 1:5,
  y1  = 6:10
)
tb %>%  
  pull(x1) # by name
[1] 1 2 3 4 5
tb %>%  
  pull(1)  # by position
[1] "A" "B" "C" "D" "E"

pull() also takes an optional name argument that specifies the column to be used as names for a named vector, which you’ll learn about in Chapter sec-vectors.

tb %>%  
  pull(x1, name = id)
A B C D E 
1 2 3 4 5 

You can also use the base R tools $ and [[. [[ can extract by name or position; $ only extracts by name but is a little less typing.

Extract by name:

tb$x1
[1] 1 2 3 4 5
tb[["x1"]]
[1] 1 2 3 4 5

Extract by position:

tb[[1]]
[1] "A" "B" "C" "D" "E"

Compared to a data frame, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.

# Tibbles complain a lot:
tb$x
Warning: Unknown or uninitialised column: `x`.
NULL
tb$z
Warning: Unknown or uninitialised column: `z`.
NULL
# Data frame use partial matching and don't complain if a column doesn't exist
df <- as.data.frame(tb)
df$x
[1] 1 2 3 4 5
df$z
NULL

For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.

Subsetting

Lastly, there are some important differences when using [.

With data.frames, [ sometimes returns a data.frame, and sometimes returns a vector.

  • This is a common source of bugs.

With tibbles, [ always returns another tibble.

  • This can sometimes cause problems when working with older code.
  • If you hit one of those functions, just use as.data.frame() to turn your tibble back to a data.frame.