NB: Deeper Into Pandas

Programming for Data Science

In this notebook, we will dive deeper into Pandas.

We will look at some ways to transform, manipulate, and combine data in the process of conducting data analysis.

There is no way to cover all of Pandas, of course, so the goal here is not have you memorize each function and method.

Rather, you should be able to get sense of how Pandas works, the patterns by which certain problems are solved.

The best way to acquire competency in the use of Pandas, or any programming language or library, is to use it and have good documentation by your side.

Setting Up

Let’s work with the Iris dataset again.

import numpy as np
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Let’s configure the index once we load the dataset.

iris.index.name = 'obs_id'
iris = iris.reset_index().set_index(['species', 'obs_id'])
iris.head(5)
sepal_length sepal_width petal_length petal_width
species obs_id
setosa 0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

You may wonder why we used .reset_index() in the previous line of code.

This method pops the index columns into the data column space, leaving the default sequence of offsets as the index.

iris.reset_index().head()
species obs_id sepal_length sepal_width petal_length petal_width
0 setosa 0 5.1 3.5 1.4 0.2
1 setosa 1 4.9 3.0 1.4 0.2
2 setosa 2 4.7 3.2 1.3 0.2
3 setosa 3 4.6 3.1 1.5 0.2
4 setosa 4 5.0 3.6 1.4 0.2

We do this so that we can use .set_index() which converts data columns into index columns.

Getting Statitistics by Axis

We saw that we can apply statistical methods, such as .mean(), to Series data.

We can also apply them to whole DataFrames. But we need to tell Pandas just how to apply the method.

If we use a statistical method on a DataFrame, Pandas defaults to performing the operation over rows for each column Series:

iris.mean()
sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

We can specify this explicitly with the axis parameter:

iris.mean(axis=0)
sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

If we wanted the applied method to get the mean over columns for each row, we’d set axis to \(1\).

iris.mean(axis=1)
species    obs_id
setosa     0         2.550
           1         2.375
           2         2.350
           3         2.350
           4         2.550
                     ...  
virginica  145       4.300
           146       3.925
           147       4.175
           148       4.325
           149       3.950
Length: 150, dtype: float64

If we want the statistic over all the data, irrespective of row or column position, we set axis to None.

In effect, we are reshaping the \(2\)-D DataFrame into a \(1\)-D Series and getting the mean.

iris.mean(axis=None)
3.4644999999999997

Applying Custom Functions with .apply()

Beyond the vectorized functions provided by Pandas, such as .mean(), we can apply custom functions to our DataFrame.

You can use regular functions here, but lambda functions are particularly appropriate.

The apply() method should be used after you have established that you can’t use an existing vectorized function.

Here we create a new feature of the square of the sepal length.

iris['sepal_len_sq'] = iris.sepal_length.apply(lambda x: x**2)
iris.head(5)
sepal_length sepal_width petal_length petal_width sepal_len_sq
species obs_id
setosa 0 5.1 3.5 1.4 0.2 26.01
1 4.9 3.0 1.4 0.2 24.01
2 4.7 3.2 1.3 0.2 22.09
3 4.6 3.1 1.5 0.2 21.16
4 5.0 3.6 1.4 0.2 25.00

To apply a function using multiple columns for each row operation, set axis=1.

Here, we compute the average of sepal_length, sepal_width:

iris['sepal_len_wid_avg'] = iris[['sepal_length','sepal_width']]\
    .apply(lambda x: (x.sepal_length + x.sepal_width) / 2, axis=1)
iris.head()
sepal_length sepal_width petal_length petal_width sepal_len_sq sepal_len_wid_avg
species obs_id
setosa 0 5.1 3.5 1.4 0.2 26.01 4.30
1 4.9 3.0 1.4 0.2 24.01 3.95
2 4.7 3.2 1.3 0.2 22.09 3.95
3 4.6 3.1 1.5 0.2 21.16 3.85
4 5.0 3.6 1.4 0.2 25.00 4.30

Note the use of the continuation operator \ so we can write out code over multiple lines.

Note also that the argument x to the lambda function stands for the given row to which it is being applied.

lambda x: (x.sepal_length + x.sepal_width) / 2, axis=1

It’s as if we are iterating through the DataFrame rows and using the function in the body of the loop.

In fact, Pandas lets you do this with such methods as .iterrows() and .itertuples(), like so:

[(x.sepal_length + x.sepal_width) / 2 for x in iris.itertuples()][:5]
[4.3, 3.95, 3.95, 3.8499999999999996, 4.3]

And, in this case, it’s faster.

%timeit [(x.sepal_length + x.sepal_width) / 2 for x in iris.itertuples()]
%timeit iris.apply(lambda x: (x.sepal_length + x.sepal_width) / 2, axis=1)
367 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.53 ms ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Using Vectorized Operations

Whenever possible, you should see if your .apply() method call can be replaced by a simple operation.

These are typically much faster.

For example, we could achieve the same results as the above use of .apply() to square a feature.

Here are two ways to do this, along with the apply method for comparison:

%timeit iris['sepal_len_sq_2'] = iris.sepal_length**2
%timeit iris['sepal_len_sq_3'] = np.square(iris.sepal_length)
106 µs ± 617 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
104 µs ± 538 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
iris
sepal_length sepal_width petal_length petal_width sepal_len_sq sepal_len_wid_avg sepal_len_sq_2 sepal_len_sq_3
species obs_id
setosa 0 5.1 3.5 1.4 0.2 26.01 4.30 26.01 26.01
1 4.9 3.0 1.4 0.2 24.01 3.95 24.01 24.01
2 4.7 3.2 1.3 0.2 22.09 3.95 22.09 22.09
3 4.6 3.1 1.5 0.2 21.16 3.85 21.16 21.16
4 5.0 3.6 1.4 0.2 25.00 4.30 25.00 25.00
... ... ... ... ... ... ... ... ... ...
virginica 145 6.7 3.0 5.2 2.3 44.89 4.85 44.89 44.89
146 6.3 2.5 5.0 1.9 39.69 4.40 39.69 39.69
147 6.5 3.0 5.2 2.0 42.25 4.75 42.25 42.25
148 6.2 3.4 5.4 2.3 38.44 4.80 38.44 38.44
149 5.9 3.0 5.1 1.8 34.81 4.45 34.81 34.81

150 rows × 8 columns

We can see that both versions are faster than using .apply().

%timeit iris['sepal_len_sq'] = iris.sepal_length.apply(lambda x: x**2)
189 µs ± 252 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Aggregation by Groups

Pandas offers a powerful set of tools to apply aggregate statistical functions to subsets of data.

The pattern is as follows:

  • Group, or “split,” the data by the distinctive values in a column or set of columns.
  • Apply a statstic to each group.
  • Combine the results in a table where each row stands for one of those unique values and the feature is the aggregate statistic.

This pattern is called split-apply-combine, which we will discuss more when we get to R.

.groupby()

The DataFrame method we use for this pattern is .groupby().

This method takes one or more column names, including index columns, and divides the table into separate groups based the unique combinations of the values in these columns.

Then a statistical method, such as .mean(), is called.

Here we compute the mean of each feature grouped by species:

iris.groupby("species").mean()
sepal_length sepal_width petal_length petal_width sepal_len_sq sepal_len_wid_avg sepal_len_sq_2 sepal_len_sq_3
species
setosa 5.006 3.428 1.462 0.246 25.1818 4.217 25.1818 25.1818
versicolor 5.936 2.770 4.260 1.326 35.4972 4.353 35.4972 35.4972
virginica 6.588 2.974 5.552 2.026 43.7980 4.781 43.7980 43.7980

We can also pass function names to the .agg() method:

iris.groupby("species").agg(['mean', 'median', 'sum', 'max', 'min']).T.head(10)
species setosa versicolor virginica
sepal_length mean 5.006 5.936 6.588
median 5.000 5.900 6.500
sum 250.300 296.800 329.400
max 5.800 7.000 7.900
min 4.300 4.900 4.900
sepal_width mean 3.428 2.770 2.974
median 3.400 2.800 3.000
sum 171.400 138.500 148.700
max 4.400 3.400 3.800
min 2.300 2.000 2.200

pd.pivot_table()

We can also use the pivot table method and apply a function aggfunc to selected value, grouped by columns.

Here we compute mean sepal length for each species:

pd.pivot_table(iris, values="sepal_length", columns=["species"], aggfunc = np.mean)
species setosa versicolor virginica
sepal_length 5.006 5.936 6.588

That produces the same result as this:

iris[['sepal_length']].groupby("species").mean().T
species setosa versicolor virginica
sepal_length 5.006 5.936 6.588

Stacking and Unstacking

.unstack()

Another way to achieve these results, but by taking advance of indexes, is unstacking.

Let’s look at what unstack() does with another dataset from Seaborn’s collection.

attention = sns.load_dataset('attention')
attention.sample(10)
Unnamed: 0 subject attention solutions score
55 55 16 focused 3 7.0
39 39 20 focused 2 6.0
26 26 7 divided 2 4.5
46 46 7 divided 3 6.0
18 18 19 focused 1 5.0
25 25 6 divided 2 5.0
21 21 2 divided 2 4.0
33 33 14 focused 2 8.0
22 22 3 divided 2 5.0
31 31 12 focused 2 9.0

This dataframe appears to record the results of an experiment on human attention.

Each row is a trial or observation in that experiment.

An analysis of the columns in this dataframe show that

  • score is a measured outcome,
  • subjects are probably individuals in a comparative study where two groups,
  • those with attention divided and those with attention focused,
  • are subject to three different solutions applied to the performance of some task.
  • Unnamed: 0 is just the row number as index.

The purpose of the test performed in each trial seems to be see which solutions are best at overcoming divied attention in the performance of those tasks.

Let’s restructure our data to reflect these assumptions.

attention1 = attention.set_index(['attention','solutions','subject']).sort_index().drop('Unnamed: 0', axis=1)
attention1.sample(10)
score
attention solutions subject
focused 1 15 8.0
2 11 5.0
16 8.0
divided 3 3 6.0
focused 3 17 6.0
12 8.0
16 7.0
divided 3 4 5.0
1 2 3.0
focused 2 18 8.0

We can use .unstack() to provide a nice, synoptic view of these data.

The operation takes the right-most index column and projects onto the column-axis.

Note that it requires a MultiIndex of at least two index columns.

attention2 = attention1.unstack()
attention2.fillna('-')
score
subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
attention solutions
divided 1 2.0 3.0 3.0 5.0 4.0 5.0 5.0 5.0 2.0 6.0 - - - - - - - - - -
2 4.0 4.0 5.0 7.0 5.0 5.0 4.5 7.0 3.0 5.0 - - - - - - - - - -
3 7.0 5.0 6.0 5.0 8.0 6.0 6.0 8.0 7.0 6.0 - - - - - - - - - -
focused 1 - - - - - - - - - - 6.0 8.0 6.0 8.0 8.0 6.0 7.0 7.0 5.0 6.0
2 - - - - - - - - - - 5.0 9.0 5.0 8.0 8.0 8.0 7.0 8.0 6.0 6.0
3 - - - - - - - - - - 6.0 8.0 9.0 7.0 7.0 7.0 6.0 6.0 6.0 5.0

Right away we can see that participants were matched to attention mode.

To complete the pattern, we can apply .mean() to the unstacked DataFrame.

ATTENTION = attention2.mean(axis=1).to_frame('score')
ATTENTION
score
attention solutions
divided 1 4.00
2 4.95
3 6.40
focused 1 6.70
2 7.00
3 6.70

We can apply .unstack() again to show the results more compactly:

ATTENTION.unstack()
score
solutions 1 2 3
attention
divided 4.0 4.95 6.4
focused 6.7 7.00 6.7

It appears that solution 3 performed well in overcoming divided attention.

.stack()

Stack is the opposite of .unstack(), of course.

It will project column labels onto the values of a single column.

Let’s look at this with the taxis database.

taxis = sns.load_dataset('taxis')
taxis.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan

Let’s define our indexes to reflect the structure of the data.

We define the index, sort it, and then use it to perform a stacking operation.

taxis1 = taxis.set_index(['pickup','dropoff']).sort_index().stack().to_frame('val')
taxis1.index.names = ['pickup','dropoff','field']
taxis1.sample(10)
val
pickup dropoff field
2019-03-09 19:47:10 2019-03-09 20:09:17 dropoff_borough Manhattan
2019-03-20 23:19:55 2019-03-20 23:46:00 distance 4.82
2019-03-03 07:48:48 2019-03-03 07:53:53 distance 1.57
2019-03-04 08:30:51 2019-03-04 08:40:54 pickup_zone Upper East Side South
2019-03-16 11:39:52 2019-03-16 11:50:36 distance 1.8
2019-03-03 04:11:09 2019-03-03 04:17:14 distance 0.8
2019-03-18 16:03:14 2019-03-18 16:10:36 pickup_borough Manhattan
2019-03-31 12:03:39 2019-03-31 12:12:51 pickup_zone Chinatown
2019-03-12 16:00:34 2019-03-12 16:06:28 passengers 1
2019-03-05 21:39:03 2019-03-05 21:49:12 pickup_zone Clinton East

Here is the data for one observation:

taxis1.loc['2019-02-28 23:29:03']
val
dropoff field
2019-02-28 23:32:35 passengers 1
distance 0.9
fare 5.0
tip 0.0
tolls 0.0
total 6.3
color green
payment cash
pickup_zone Old Astoria
dropoff_zone Long Island City/Queens Plaza
pickup_borough Queens
dropoff_borough Queens

Combining DataFrames

There are at least \(3\) ways to combine DataFrame in Pandas: by concatening, merging, or joining.

pd.concat()

To concatenate to DataFrames, we use pd.concat().

Here’s an example.

df1 = pd.DataFrame(np.random.randn(3, 4))
df2 = pd.DataFrame(np.random.randn(3, 4))
df1
0 1 2 3
0 0.361659 0.802342 -0.986953 0.350928
1 0.258543 -0.106647 1.467904 1.238903
2 2.896572 0.739011 0.705529 -0.397988
df2
0 1 2 3
0 0.023824 1.013029 0.200236 0.274996
1 2.256482 0.982360 0.933824 0.774376
2 -0.203034 1.013550 -0.812464 -2.777192

The Pandas function takes a list of DataFrames to combine, an optional keys argument to create a MultiIndex, and an axis parameter to choose between combining by rows or columns.

To combine by rows, set axis to \(0\).

df3 = pd.concat([df1, df2], keys=['a','b'], axis=0)
df3
0 1 2 3
a 0 0.361659 0.802342 -0.986953 0.350928
1 0.258543 -0.106647 1.467904 1.238903
2 2.896572 0.739011 0.705529 -0.397988
b 0 0.023824 1.013029 0.200236 0.274996
1 2.256482 0.982360 0.933824 0.774376
2 -0.203034 1.013550 -0.812464 -2.777192

We set keys because indexes of the two source DataFrames are not unique.

Here we use the key to get back the second DataFrame.

df3.loc['b']
0 1 2 3
0 0.023824 1.013029 0.200236 0.274996
1 2.256482 0.982360 0.933824 0.774376
2 -0.203034 1.013550 -0.812464 -2.777192

To combine by columns, we set the axis to \(1\).

This assumes that the two DataFrames share an index, i.e. that the indexes represent the same events.

df4 = pd.concat([df1, df2], axis=1, keys=['a', 'b'])
df4
a b
0 1 2 3 0 1 2 3
0 0.361659 0.802342 -0.986953 0.350928 0.023824 1.013029 0.200236 0.274996
1 0.258543 -0.106647 1.467904 1.238903 2.256482 0.982360 0.933824 0.774376
2 2.896572 0.739011 0.705529 -0.397988 -0.203034 1.013550 -0.812464 -2.777192
df4.b
0 1 2 3
0 0.023824 1.013029 0.200236 0.274996
1 2.256482 0.982360 0.933824 0.774376
2 -0.203034 1.013550 -0.812464 -2.777192

.merge()

Another way to combine DataFrames is with the Pandas function pd.marge().

It works similar to a SQL join.

This function takes two DataFrames as its first and second argument.

The on parameter specifies the columns on which to join.

The how parameter specifies the type of merge, i.e. left, right, outer, inner, or cross.

Here we create two tables, left and right.

We then right join them on key.

Right join means include all records from table on right.

The key is used for matching up the records.

left = pd.DataFrame({"key": ["jamie", "bill"], "lval": [15, 22]})
right = pd.DataFrame({"key": ["jamie", "bill", "asher"], "rval": [4, 5, 8]})
left
key lval
0 jamie 15
1 bill 22
right
key rval
0 jamie 4
1 bill 5
2 asher 8
merged = pd.merge(left, right, on="key", how="right")
merged
key lval rval
0 jamie 15.0 4
1 bill 22.0 5
2 asher NaN 8

Notice the NaN inserted into the record with key='asher', since the left table didn’t contain the key.

In this next example, the value columns have the same name: val. Notice what happens to the column names.

left = pd.DataFrame({"key": ["jamie", "bill"], "val": [15, 22]})
right = pd.DataFrame({"key": ["jamie", "bill", "asher"], "val": [4, 5, 8]})
merged = pd.merge(left, right, on="key", how="right")
left
key val
0 jamie 15
1 bill 22
right
key val
0 jamie 4
1 bill 5
2 asher 8
merged
key val_x val_y
0 jamie 15.0 4
1 bill 22.0 5
2 asher NaN 8

.join()

As the name implies, .join() is also SQL-like joiner, but it takes advantage of indexes.

It assumes the DataFrames to be joined share index values.

Here we redefine our DataFrames with indexes.

left2 = left.set_index('key').copy()
right2 = right.set_index('key').copy()
left2
val
key
jamie 15
bill 22
right2
val
key
jamie 4
bill 5
asher 8

Now we just use the method and Pandas assumes the indexes are shared.

We define a right suffix rsuffix to ensure the column names are unique.

right2.join(left2, rsuffix='_r') 
val val_r
key
jamie 4 15.0
bill 5 22.0
asher 8 NaN

This is an inner join:

right2.join(left2, how='inner', rsuffix='_r')
val val_r
key
jamie 4 15
bill 5 22

Things to Pay Attention To

Use join if you have shared indexes.

Use merge if you do not have shared indexes.

Use concat to combine based on shared indexes or columns

Pay attention to resulting dataframe indexes and column names.

Categoricals

It is often necessary to convert categorical data into numeric form.

Categorical data are data in which the values are members of a set, such as the species names in the Iris data set.

In machine learning, we often want to project these onto the column axis where each row can only contain one True value.

pd.get_dummies()

This method will project a list of values in a column onto the column axis, using distinct values as the column names.

This is similar to .unstack(), except that the values for each of the new columns is boolean.

This is also called “one-hot encoding,” because only one feature can be “hot,” i.e. True, for each row.

Here a some important parameters for the function:

  • prefix : append prefix to column names (a good idea for later use)
  • drop_first: remove first level, as only k-1 variables needed to represent k levels

Let’s look at example.

Here we define a little DataFrame of cats, with a feature for breed.

cats = pd.DataFrame({'breed':['persian', 'persian', 'siamese', 'himalayan', 'burmese']})
cats
breed
0 persian
1 persian
2 siamese
3 himalayan
4 burmese

To convert to one-hot encoded form, we can do the following:

dummy_cats = pd.get_dummies(cats.breed, prefix='breed')
dummy_cats
breed_burmese breed_himalayan breed_persian breed_siamese
0 False False True False
1 False False True False
2 False False False True
3 False True False False
4 True False False False

We can drop the first column, since it can be inferred.

This is done improve the efficiency of training a model.

pd.get_dummies(cats.breed, drop_first=True, prefix='breed')
breed_himalayan breed_persian breed_siamese
0 False True False
1 False True False
2 False False True
3 True False False
4 False False False

Notice burmese was dropped (first level by alphabet) since it can be inferred.