c(2, 3, 5)
- 2
- 3
- 5
Programming for Data Science
Basic R comes with several data structures:
Here are some quick defintions before going into more detail.
A vector is an ordered collection of elements where each elements contains the same data type (integers or characters or reals or whatever).
Vectors are immutable.
Vectors are indexed beginning with the number \(1\).
\(1\)-based indexing is used throughout R.
A matrix is just a two-dimensional vector.
It too has one data type and is immutable.
An array in R is a generalization of the vector and matrix. It may have one or more dimensions.
So, an array with one dimension is almost the same as a vector.
An array with two dimensions is almost the same as a matrix.
An array with three or more dimensions is an n-dimensional array.
They are like NumPy arrays in Python.
A list can hold items of different types and the list size can be increased on the fly.
List contents can be accessed either by index (like mylist[[1]]
) or by name (like mylist$age
).
Lists are like lists in Python.
A data frame is called a table in many languages.
This is the workhorse of R.
Each column holds the same type, and the columns can have header names.
A data frame is essential a kind of a list — a list of vectors each with the same length, but of varying data types.
Structure | Dim | Data Type | Shape | Python |
---|---|---|---|---|
Vector | \(1\) | single | — | — |
Matrix | \(2\) | single | uniform | — |
Array | \(N\) | single | uniform | NumPy array |
List | \(1\) | multiple | ragged | List, Dict |
Data Frame | \(2\) | multiple | uniform | Pandas Data Frame |
These reflect the evolution of R.
We mainly use Vectors and Data Frames.
c()
Vectors may be created with the c()
function (“c” stands for combine).
Here is a vector of three numeric values:
c(2, 3, 5)
And here is a vector of logical values:
c(TRUE, FALSE, TRUE, FALSE, FALSE)
A vector can also contain character strings:
c("aa", "bb", "cc", "dd", "ee")
Vectors may be created from sequences using :
, seq()
, and rep()
Vectors can be made out of sequences which may be generated in a few ways.
<- 2:5
s1 s1
The seq()
function is like Python’s range()
.
<- seq(from=1, to=5, by=2)
s2 s2
You can drop the argument names and write seq(1,5,2)
.
The rep()
function will create a series of repeated values:
<- rep(1, 5)
s3 s3
length()
The number of members in a vector is given by the length()
function.
length(c("aa", "bb", "cc", "dd", "ee"))
Vectors can be combined via the function c()
.
<- c(2, 3, 5)
n <- c("aa", "bb", "cc", "dd", "ee")
s c(n, s)
Notice how the numeric values are being coerced into character strings when the two vectors are combined.
This is necessary so as to maintain the same primitive data type for members in the same vector.
Arithmetic operations of vectors are performed member-by-member, i.e., member-wise.
We called this ‘element-wise’ in the context of NumPy.
For example, suppose we have two vectors a
and b
.
<- c(1, 3, 5, 7)
a <- c(1, 2, 4, 8) b
If we multiply a
by 5, we would get a vector with each of its members multiplied by 5.
5 * a
And if we add a
and b
together, the sum would be a vector whose members are the sum of the corresponding members from a
and b
.
+ b a
Similarly for subtraction, multiplication and division, we get new vectors via member-wise operations.
- b a
* b a
/ b a
If two vectors in an operation are of unequal length, the shorter one will be recycled in order to match the longer vector.
This is similar to broadcasting in NumPy and Pandas.
For example, the following vectors u
and v
have different lengths, and their sum is computed by recycling values of the shorter vector u
.
<- c(10, 20, 30)
u <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
v + v u
We retrieve values in a vector by declaring an index inside a single square bracket index []
operator.
Vector indexes are \(1\)-based.
<- c("aa", "bb", "cc", "dd", "ee")
s 3] s[
Unlike Python, if the index is negative, it will remove the member whose position has the same absolute value as the negative index.
It really does mean subtraction!
For example, the following creates a vector slice with the third member removed.
-3] s[
Values for out-of-range indexes are reported as NA
.
10] s[
A new vector can be sliced from a given vector with a numeric vector passed to the indexing operator.
Index vectors consist of member positions of the original vector to be retrieved.
Here we see how to retrieve a vector slice containing the second and third members of a given vector s
.
<- c("aa", "bb", "cc", "dd", "ee")
s c(2, 3)] s[
The index vector allows duplicate values. Hence the following retrieves a member twice in one operation.
c(2, 3, 3)] s[
The index vector can even be out-of-order. Here is a vector slice with the order of first and second members reversed.
c(2, 1, 3)] s[
In Python, we called this “fancy indexing.”
To produce a vector slice between two indexes, we can use the colon sequence operator :
.
This can be convenient for situations involving large vectors.
2:4] s[
A new vector can be sliced from a given vector with a logical index vector.
The logical vector must the same length as the original vector.
Its members are TRUE
if the corresponding members in the original vector are to be included in the slice, and FALSE
if otherwise.
This is what we called boolean indexing and masking in Python.
For example, consider the following vector s
of length \(5\).
<- c("aa", "bb", "cc", "dd", "ee") s
To retrieve the the second and fourth members of s
, we define a logical vector L
of the same length, and have its second and fourth members set as TRUE
.
= c(FALSE, TRUE, FALSE, TRUE, FALSE)
L s[L]
The code can be abbreviated into a single line.
c(FALSE, TRUE, FALSE, TRUE, FALSE)] s[
names()
We can assign names to vector members, too.
<- c("Mary", "Sue")
v names(v) <- c("First", "Last")
v
Now we can retrieve the first member by name, much like a Python dictionary.
"First"] v[
We can also reverse the order with a character string index vector.
c("Last", "First")] v[
A list is a generic vector containing other objects.
This is close to a Python list.
The following variable x
is a list containing copies of three vectors n
, s
, b
, and a
numeric value \(3\).
<- c(2, 3, 5)
n <- c("aa", "bb", "cc", "dd", "ee")
s <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
b
<- list(n, s, b, 3) # x contains copies of n, s, b
x x
We can call the print function to show how R represents these values internally.
print(x)
[[1]]
[1] 2 3 5
[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
[[3]]
[1] TRUE FALSE TRUE FALSE FALSE
[[4]]
[1] 3
Note that odd bracket notation.
It indicates that each list member contains a vector, even if the length of the vector is \(1.\)
We retrieve a list slice with the single square bracket []
operator.
The following is a slice containing the second member of x
, which is a copy of s
.
2] x[
With a vector, we can retrieve a slice with multiple members.
Here a slice containing the second and fourth members of x
.
c(2, 4)] x[
[[]]
To reference a list member directly, we use the double square bracket [[]]
operator.
The following object x[[2]]
is the second member of x
.
In other words, *x[[2]]
is a true copy of s
, not a slice containing s
or its copy.
print(x[2])
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
print(x[[2]])
[1] "aa" "bb" "cc" "dd" "ee"
We can modify its content directly.
2]][1] <- "ta" x[[
print(x[2])
[[1]]
[1] "ta" "bb" "cc" "dd" "ee"
And s
is unaffected.
print(s)
[1] "aa" "bb" "cc" "dd" "ee"
Note that this notation is opposite to Python’s syntax in NumPy and Pandas.
In Python, a double bracket containing a single value passed to a DataFrame would return a DataFrame, whereas a single bracket would return a Series.
This is because the double bracket means a list is being used to select data, whereas a single bracket means a scalar is being used.
# Returns a one column DataFrame
df[[x]] # Returns a Series df[x]
Finally, let’s take a quick look at categorical data.
Categorical data are called “factors” in R.
Categorical data consist of numbers or strings that form a finite set of possible values.
Examples include scales and disjoint labels:
Although it looks like a data type, a factor is actually kind of data structure, since they organize data types, i.e. characters and numbers.
Factoris categorize data into levels.
Levels are distinct values, like sets in Python or LOVs in SQL.
Levels are stored alongside the vector in the factor object.
Levels constrain what can be added to the factor vector.
Levels are always characters, even when the data are numeric or boolean
They are created with factor()
taking a vector as input.
They are analagous to categories in Pandas.
Let’s looks at an example.
Take a vector of integers.
<- c(1,5,6,9,4,3,5,8,7,6,3,0,0,0,1,2,3,6,4,5,7,9) v1
Noticess that it has no levels associated with it.
levels(v1)
NULL
We may convert the vector in a factor, like so:
= factor(v1) f1
Now see that it has extracted a distinct list of items and converted them to strings.
print(levels(f1))
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
Printing the factor shows that the object contains two structures:
print(f1)
[1] 1 5 6 9 4 3 5 8 7 6 3 0 0 0 1 2 3 6 4 5 7 9
Levels: 0 1 2 3 4 5 6 7 8 9
The object has properties accessible by functions.
print(nlevels(f1))
[1] 10
print(levels(f1))
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
Note that levels act as a constraint on the factor vector.
If we want to edit the factor to have a value that is not in the distinct list of levels, we get an error.
5] <- 20 f1[
Warning message in `[<-.factor`(`*tmp*`, 5, value = 20):
“invalid factor level, NA generated”
Note that we also blow away the original value!
print(f1)
[1] 1 5 6 9 <NA> 3 5 8 7 6 3 0 0 0 1
[16] 2 3 6 4 5 7 9
Levels: 0 1 2 3 4 5 6 7 8 9