Programming for Data Science
In this notebook, we will dive deeper into Pandas.
We will look at some ways to transform, manipulate, and combine data in the process of conducting data analysis.
There is no way to cover all of Pandas , of course, so the goal here is not have you memorize each function and method.
Rather, you should be able to get sense of how Pandas works, the patterns by which certain problems are solved.
The best way to acquire competency in the use of Pandas, or any programming language or library, is to use it and have good documentation by your side.
Setting Up
Let’s work with the Iris dataset again.
import numpy as np
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris' )
iris.head()
0
5.1
3.5
1.4
0.2
setosa
1
4.9
3.0
1.4
0.2
setosa
2
4.7
3.2
1.3
0.2
setosa
3
4.6
3.1
1.5
0.2
setosa
4
5.0
3.6
1.4
0.2
setosa
Let’s configure the index once we load the dataset.
iris.index.name = 'obs_id'
iris = iris.reset_index().set_index(['species' , 'obs_id' ])
species
obs_id
setosa
0
5.1
3.5
1.4
0.2
1
4.9
3.0
1.4
0.2
2
4.7
3.2
1.3
0.2
3
4.6
3.1
1.5
0.2
4
5.0
3.6
1.4
0.2
You may wonder why we used .reset_index()
in the previous line of code.
This method pops the index columns into the data column space, leaving the default sequence of offsets as the index.
iris.reset_index().head()
0
setosa
0
5.1
3.5
1.4
0.2
1
setosa
1
4.9
3.0
1.4
0.2
2
setosa
2
4.7
3.2
1.3
0.2
3
setosa
3
4.6
3.1
1.5
0.2
4
setosa
4
5.0
3.6
1.4
0.2
We do this so that we can use .set_index()
which converts data columns into index columns.
Getting Statitistics by Axis
We saw that we can apply statistical methods, such as .mean()
, to Series data.
We can also apply them to whole DataFrames. But we need to tell Pandas just how to apply the method.
If we use a statistical method on a DataFrame, Pandas defaults to performing the operation over rows for each column Series:
sepal_length 5.843333
sepal_width 3.057333
petal_length 3.758000
petal_width 1.199333
dtype: float64
We can specify this explicitly with the axis
parameter:
sepal_length 5.843333
sepal_width 3.057333
petal_length 3.758000
petal_width 1.199333
dtype: float64
If we wanted the applied method to get the mean over columns for each row, we’d set axis
to \(1\) .
species obs_id
setosa 0 2.550
1 2.375
2 2.350
3 2.350
4 2.550
...
virginica 145 4.300
146 3.925
147 4.175
148 4.325
149 3.950
Length: 150, dtype: float64
If we want the statistic over all the data, irrespective of row or column position, we set axis
to None
.
In effect, we are reshaping the \(2\) -D DataFrame into a \(1\) -D Series and getting the mean.
Applying Custom Functions with .apply()
Beyond the vectorized functions provided by Pandas, such as .mean()
, we can apply custom functions to our DataFrame.
You can use regular functions here, but lambda functions are particularly appropriate.
The apply()
method should be used after you have established that you can’t use an existing vectorized function.
Here we create a new feature of the square of the sepal length.
iris['sepal_len_sq' ] = iris.sepal_length.apply (lambda x: x** 2 )
iris.head(5 )
species
obs_id
setosa
0
5.1
3.5
1.4
0.2
26.01
1
4.9
3.0
1.4
0.2
24.01
2
4.7
3.2
1.3
0.2
22.09
3
4.6
3.1
1.5
0.2
21.16
4
5.0
3.6
1.4
0.2
25.00
To apply a function using multiple columns for each row operation, set axis=1
.
Here, we compute the average of sepal_length
, sepal_width
:
iris['sepal_len_wid_avg' ] = iris[['sepal_length' ,'sepal_width' ]]\
.apply (lambda x: (x.sepal_length + x.sepal_width) / 2 , axis= 1 )
iris.head()
species
obs_id
setosa
0
5.1
3.5
1.4
0.2
26.01
4.30
1
4.9
3.0
1.4
0.2
24.01
3.95
2
4.7
3.2
1.3
0.2
22.09
3.95
3
4.6
3.1
1.5
0.2
21.16
3.85
4
5.0
3.6
1.4
0.2
25.00
4.30
Note the use of the continuation operator \
so we can write out code over multiple lines.
Note also that the argument x
to the lambda function stands for the given row to which it is being applied.
lambda x: (x.sepal_length + x.sepal_width) / 2 , axis= 1
It’s as if we are iterating through the DataFrame rows and using the function in the body of the loop.
In fact, Pandas lets you do this with such methods as .iterrows()
and .itertuples()
, like so:
[(x.sepal_length + x.sepal_width) / 2 for x in iris.itertuples()][:5 ]
[4.3, 3.95, 3.95, 3.8499999999999996, 4.3]
And, in this case, it’s faster.
% timeit [(x.sepal_length + x.sepal_width) / 2 for x in iris.itertuples()]
% timeit iris.apply (lambda x: (x.sepal_length + x.sepal_width) / 2 , axis= 1 )
367 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.53 ms ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Using Vectorized Operations
Whenever possible, you should see if your .apply()
method call can be replaced by a simple operation.
These are typically much faster.
For example, we could achieve the same results as the above use of .apply()
to square a feature.
Here are two ways to do this, along with the apply method for comparison:
% timeit iris['sepal_len_sq_2' ] = iris.sepal_length** 2
% timeit iris['sepal_len_sq_3' ] = np.square(iris.sepal_length)
106 µs ± 617 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
104 µs ± 538 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
species
obs_id
setosa
0
5.1
3.5
1.4
0.2
26.01
4.30
26.01
26.01
1
4.9
3.0
1.4
0.2
24.01
3.95
24.01
24.01
2
4.7
3.2
1.3
0.2
22.09
3.95
22.09
22.09
3
4.6
3.1
1.5
0.2
21.16
3.85
21.16
21.16
4
5.0
3.6
1.4
0.2
25.00
4.30
25.00
25.00
...
...
...
...
...
...
...
...
...
...
virginica
145
6.7
3.0
5.2
2.3
44.89
4.85
44.89
44.89
146
6.3
2.5
5.0
1.9
39.69
4.40
39.69
39.69
147
6.5
3.0
5.2
2.0
42.25
4.75
42.25
42.25
148
6.2
3.4
5.4
2.3
38.44
4.80
38.44
38.44
149
5.9
3.0
5.1
1.8
34.81
4.45
34.81
34.81
150 rows × 8 columns
We can see that both versions are faster than using .apply()
.
% timeit iris['sepal_len_sq' ] = iris.sepal_length.apply (lambda x: x** 2 )
189 µs ± 252 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Aggregation by Groups
Pandas offers a powerful set of tools to apply aggregate statistical functions to subsets of data.
The pattern is as follows:
Group, or “split,” the data by the distinctive values in a column or set of columns.
Apply a statstic to each group.
Combine the results in a table where each row stands for one of those unique values and the feature is the aggregate statistic.
This pattern is called split-apply-combine
, which we will discuss more when we get to R.
.groupby()
The DataFrame method we use for this pattern is .groupby()
.
This method takes one or more column names, including index columns, and divides the table into separate groups based the unique combinations of the values in these columns.
Then a statistical method, such as .mean()
, is called.
Here we compute the mean of each feature grouped by species:
iris.groupby("species" ).mean()
species
setosa
5.006
3.428
1.462
0.246
25.1818
4.217
25.1818
25.1818
versicolor
5.936
2.770
4.260
1.326
35.4972
4.353
35.4972
35.4972
virginica
6.588
2.974
5.552
2.026
43.7980
4.781
43.7980
43.7980
We can also pass function names to the .agg()
method:
iris.groupby("species" ).agg(['mean' , 'median' , 'sum' , 'max' , 'min' ]).T.head(10 )
sepal_length
mean
5.006
5.936
6.588
median
5.000
5.900
6.500
sum
250.300
296.800
329.400
max
5.800
7.000
7.900
min
4.300
4.900
4.900
sepal_width
mean
3.428
2.770
2.974
median
3.400
2.800
3.000
sum
171.400
138.500
148.700
max
4.400
3.400
3.800
min
2.300
2.000
2.200
pd.pivot_table()
We can also use the pivot table method and apply a function aggfunc
to selected value, grouped by columns.
Here we compute mean sepal length for each species:
pd.pivot_table(iris, values= "sepal_length" , columns= ["species" ], aggfunc = np.mean)
sepal_length
5.006
5.936
6.588
That produces the same result as this:
iris[['sepal_length' ]].groupby("species" ).mean().T
sepal_length
5.006
5.936
6.588
Stacking and Unstacking
.unstack()
Another way to achieve these results, but by taking advance of indexes, is unstacking.
Let’s look at what unstack()
does with another dataset from Seaborn’s collection.
attention = sns.load_dataset('attention' )
55
55
16
focused
3
7.0
39
39
20
focused
2
6.0
26
26
7
divided
2
4.5
46
46
7
divided
3
6.0
18
18
19
focused
1
5.0
25
25
6
divided
2
5.0
21
21
2
divided
2
4.0
33
33
14
focused
2
8.0
22
22
3
divided
2
5.0
31
31
12
focused
2
9.0
This dataframe appears to record the results of an experiment on human attention.
Each row is a trial or observation in that experiment.
An analysis of the columns in this dataframe show that
score
is a measured outcome,
subject
s are probably individuals in a comparative study where two groups,
those with attention
divided
and those with attention
focused
,
are subject to three different solutions
applied to the performance of some task.
Unnamed: 0
is just the row number as index.
The purpose of the test performed in each trial seems to be see which solutions are best at overcoming divied attention in the performance of those tasks.
Let’s restructure our data to reflect these assumptions.
attention1 = attention.set_index(['attention' ,'solutions' ,'subject' ]).sort_index().drop('Unnamed: 0' , axis= 1 )
attention1.sample(10 )
attention
solutions
subject
focused
1
15
8.0
2
11
5.0
16
8.0
divided
3
3
6.0
focused
3
17
6.0
12
8.0
16
7.0
divided
3
4
5.0
1
2
3.0
focused
2
18
8.0
We can use .unstack()
to provide a nice, synoptic view of these data.
The operation takes the right-most index column and projects onto the column-axis.
Note that it requires a MultiIndex of at least two index columns.
attention2 = attention1.unstack()
attention2.fillna('-' )
subject
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
divided
1
2.0
3.0
3.0
5.0
4.0
5.0
5.0
5.0
2.0
6.0
-
-
-
-
-
-
-
-
-
-
2
4.0
4.0
5.0
7.0
5.0
5.0
4.5
7.0
3.0
5.0
-
-
-
-
-
-
-
-
-
-
3
7.0
5.0
6.0
5.0
8.0
6.0
6.0
8.0
7.0
6.0
-
-
-
-
-
-
-
-
-
-
focused
1
-
-
-
-
-
-
-
-
-
-
6.0
8.0
6.0
8.0
8.0
6.0
7.0
7.0
5.0
6.0
2
-
-
-
-
-
-
-
-
-
-
5.0
9.0
5.0
8.0
8.0
8.0
7.0
8.0
6.0
6.0
3
-
-
-
-
-
-
-
-
-
-
6.0
8.0
9.0
7.0
7.0
7.0
6.0
6.0
6.0
5.0
Right away we can see that participants were matched to attention mode.
To complete the pattern, we can apply .mean()
to the unstacked DataFrame.
ATTENTION = attention2.mean(axis= 1 ).to_frame('score' )
ATTENTION
attention
solutions
divided
1
4.00
2
4.95
3
6.40
focused
1
6.70
2
7.00
3
6.70
We can apply .unstack()
again to show the results more compactly:
solutions
1
2
3
divided
4.0
4.95
6.4
focused
6.7
7.00
6.7
It appears that solution 3 performed well in overcoming divided attention.
.stack()
Stack is the opposite of .unstack()
, of course.
It will project column labels onto the values of a single column.
Let’s look at this with the taxis
database.
taxis = sns.load_dataset('taxis' )
taxis.head()
0
2019-03-23 20:21:09
2019-03-23 20:27:24
1
1.60
7.0
2.15
0.0
12.95
yellow
credit card
Lenox Hill West
UN/Turtle Bay South
Manhattan
Manhattan
1
2019-03-04 16:11:55
2019-03-04 16:19:00
1
0.79
5.0
0.00
0.0
9.30
yellow
cash
Upper West Side South
Upper West Side South
Manhattan
Manhattan
2
2019-03-27 17:53:01
2019-03-27 18:00:25
1
1.37
7.5
2.36
0.0
14.16
yellow
credit card
Alphabet City
West Village
Manhattan
Manhattan
3
2019-03-10 01:23:59
2019-03-10 01:49:51
1
7.70
27.0
6.15
0.0
36.95
yellow
credit card
Hudson Sq
Yorkville West
Manhattan
Manhattan
4
2019-03-30 13:27:42
2019-03-30 13:37:14
3
2.16
9.0
1.10
0.0
13.40
yellow
credit card
Midtown East
Yorkville West
Manhattan
Manhattan
Let’s define our indexes to reflect the structure of the data.
We define the index, sort it, and then use it to perform a stacking operation.
taxis1 = taxis.set_index(['pickup' ,'dropoff' ]).sort_index().stack().to_frame('val' )
taxis1.index.names = ['pickup' ,'dropoff' ,'field' ]
taxis1.sample(10 )
pickup
dropoff
field
2019-03-09 19:47:10
2019-03-09 20:09:17
dropoff_borough
Manhattan
2019-03-20 23:19:55
2019-03-20 23:46:00
distance
4.82
2019-03-03 07:48:48
2019-03-03 07:53:53
distance
1.57
2019-03-04 08:30:51
2019-03-04 08:40:54
pickup_zone
Upper East Side South
2019-03-16 11:39:52
2019-03-16 11:50:36
distance
1.8
2019-03-03 04:11:09
2019-03-03 04:17:14
distance
0.8
2019-03-18 16:03:14
2019-03-18 16:10:36
pickup_borough
Manhattan
2019-03-31 12:03:39
2019-03-31 12:12:51
pickup_zone
Chinatown
2019-03-12 16:00:34
2019-03-12 16:06:28
passengers
1
2019-03-05 21:39:03
2019-03-05 21:49:12
pickup_zone
Clinton East
Here is the data for one observation:
taxis1.loc['2019-02-28 23:29:03' ]
dropoff
field
2019-02-28 23:32:35
passengers
1
distance
0.9
fare
5.0
tip
0.0
tolls
0.0
total
6.3
color
green
payment
cash
pickup_zone
Old Astoria
dropoff_zone
Long Island City/Queens Plaza
pickup_borough
Queens
dropoff_borough
Queens
Combining DataFrames
There are at least \(3\) ways to combine DataFrame in Pandas: by concatening, merging, or joining.
pd.concat()
To concatenate to DataFrames, we use pd.concat()
.
Here’s an example.
df1 = pd.DataFrame(np.random.randn(3 , 4 ))
df2 = pd.DataFrame(np.random.randn(3 , 4 ))
0
0.361659
0.802342
-0.986953
0.350928
1
0.258543
-0.106647
1.467904
1.238903
2
2.896572
0.739011
0.705529
-0.397988
0
0.023824
1.013029
0.200236
0.274996
1
2.256482
0.982360
0.933824
0.774376
2
-0.203034
1.013550
-0.812464
-2.777192
The Pandas function takes a list of DataFrames to combine, an optional keys
argument to create a MultiIndex, and an axis
parameter to choose between combining by rows or columns.
To combine by rows, set axis
to \(0\) .
df3 = pd.concat([df1, df2], keys= ['a' ,'b' ], axis= 0 )
a
0
0.361659
0.802342
-0.986953
0.350928
1
0.258543
-0.106647
1.467904
1.238903
2
2.896572
0.739011
0.705529
-0.397988
b
0
0.023824
1.013029
0.200236
0.274996
1
2.256482
0.982360
0.933824
0.774376
2
-0.203034
1.013550
-0.812464
-2.777192
We set keys
because indexes of the two source DataFrames are not unique.
Here we use the key to get back the second DataFrame.
0
0.023824
1.013029
0.200236
0.274996
1
2.256482
0.982360
0.933824
0.774376
2
-0.203034
1.013550
-0.812464
-2.777192
To combine by columns, we set the axis
to \(1\) .
This assumes that the two DataFrames share an index, i.e. that the indexes represent the same events.
df4 = pd.concat([df1, df2], axis= 1 , keys= ['a' , 'b' ])
0
1
2
3
0
1
2
3
0
0.361659
0.802342
-0.986953
0.350928
0.023824
1.013029
0.200236
0.274996
1
0.258543
-0.106647
1.467904
1.238903
2.256482
0.982360
0.933824
0.774376
2
2.896572
0.739011
0.705529
-0.397988
-0.203034
1.013550
-0.812464
-2.777192
0
0.023824
1.013029
0.200236
0.274996
1
2.256482
0.982360
0.933824
0.774376
2
-0.203034
1.013550
-0.812464
-2.777192
.merge()
Another way to combine DataFrames is with the Pandas function pd.marge()
.
It works similar to a SQL join.
This function takes two DataFrames as its first and second argument.
The on
parameter specifies the columns on which to join.
The how
parameter specifies the type of merge, i.e. left
, right
, outer
, inner
, or cross
.
Here we create two tables, left
and right
.
We then right join them on key
.
Right join means include all records from table on right.
The key
is used for matching up the records.
left = pd.DataFrame({"key" : ["jamie" , "bill" ], "lval" : [15 , 22 ]})
right = pd.DataFrame({"key" : ["jamie" , "bill" , "asher" ], "rval" : [4 , 5 , 8 ]})
0
jamie
4
1
bill
5
2
asher
8
merged = pd.merge(left, right, on= "key" , how= "right" )
merged
0
jamie
15.0
4
1
bill
22.0
5
2
asher
NaN
8
Notice the NaN
inserted into the record with key='asher'
, since the left table didn’t contain the key.
In this next example, the value columns have the same name: val . Notice what happens to the column names.
left = pd.DataFrame({"key" : ["jamie" , "bill" ], "val" : [15 , 22 ]})
right = pd.DataFrame({"key" : ["jamie" , "bill" , "asher" ], "val" : [4 , 5 , 8 ]})
merged = pd.merge(left, right, on= "key" , how= "right" )
0
jamie
4
1
bill
5
2
asher
8
0
jamie
15.0
4
1
bill
22.0
5
2
asher
NaN
8
.join()
As the name implies, .join()
is also SQL-like joiner, but it takes advantage of indexes.
It assumes the DataFrames to be joined share index values.
Here we redefine our DataFrames with indexes.
left2 = left.set_index('key' ).copy()
right2 = right.set_index('key' ).copy()
key
jamie
4
bill
5
asher
8
Now we just use the method and Pandas assumes the indexes are shared.
We define a right suffix rsuffix
to ensure the column names are unique.
right2.join(left2, rsuffix= '_r' )
key
jamie
4
15.0
bill
5
22.0
asher
8
NaN
This is an inner join:
right2.join(left2, how= 'inner' , rsuffix= '_r' )
Things to Pay Attention To
Use join if you have shared indexes.
Use merge if you do not have shared indexes.
Use concat to combine based on shared indexes or columns
Pay attention to resulting dataframe indexes and column names.
Categoricals
It is often necessary to convert categorical data into numeric form.
Categorical data are data in which the values are members of a set, such as the species names in the Iris data set.
In machine learning, we often want to project these onto the column axis where each row can only contain one True
value.
pd.get_dummies()
This method will project a list of values in a column onto the column axis, using distinct values as the column names.
This is similar to .unstack()
, except that the values for each of the new columns is boolean
.
This is also called “one-hot encoding,” because only one feature can be “hot,” i.e. True
, for each row.
Here a some important parameters for the function:
prefix
: append prefix to column names (a good idea for later use)
drop_first
: remove first level, as only k-1
variables needed to represent k
levels
Let’s look at example.
Here we define a little DataFrame of cats, with a feature for breed.
cats = pd.DataFrame({'breed' :['persian' , 'persian' , 'siamese' , 'himalayan' , 'burmese' ]})
0
persian
1
persian
2
siamese
3
himalayan
4
burmese
To convert to one-hot encoded form, we can do the following:
dummy_cats = pd.get_dummies(cats.breed, prefix= 'breed' )
dummy_cats
0
False
False
True
False
1
False
False
True
False
2
False
False
False
True
3
False
True
False
False
4
True
False
False
False
We can drop the first column, since it can be inferred.
This is done improve the efficiency of training a model.
pd.get_dummies(cats.breed, drop_first= True , prefix= 'breed' )
0
False
True
False
1
False
True
False
2
False
False
True
3
True
False
False
4
False
False
False
Notice burmese
was dropped (first level by alphabet) since it can be inferred.