NB: Deeper Into Pandas

Programming for Data Science

In this notebook, we will dive deeper into Pandas.

We will look at some ways to transform, manipulate, and combine data in the process of conducting data analysis.

There is no way to cover all of Pandas, of course, so the goal here is not have you memorize each function and method.

Rather, you should be able to get sense of how Pandas works, the patterns by which certain problems are solved.

The best way to acquire competency in the use of Pandas, or any programming language or library, is to use it and have good documentation by your side.

Setting Up

Let’s work with the Iris dataset again.

import numpy as np
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Let’s configure the index once we load the dataset.

iris.index.name = 'obs_id'
iris = iris.reset_index().set_index(['species', 'obs_id'])

iris.head(5)

		sepal_length	sepal_width	petal_length	petal_width
species	obs_id
setosa	0	5.1	3.5	1.4	0.2
	1	4.9	3.0	1.4	0.2
	2	4.7	3.2	1.3	0.2
	3	4.6	3.1	1.5	0.2
	4	5.0	3.6	1.4	0.2

You may wonder why we used .reset_index() in the previous line of code.

This method pops the index columns into the data column space, leaving the default sequence of offsets as the index.

iris.reset_index().head()

	species	obs_id	sepal_length	sepal_width	petal_length	petal_width
0	setosa	0	5.1	3.5	1.4	0.2
1	setosa	1	4.9	3.0	1.4	0.2
2	setosa	2	4.7	3.2	1.3	0.2
3	setosa	3	4.6	3.1	1.5	0.2
4	setosa	4	5.0	3.6	1.4	0.2

We do this so that we can use .set_index() which converts data columns into index columns.

Getting Statitistics by Axis

We saw that we can apply statistical methods, such as .mean(), to Series data.

We can also apply them to whole DataFrames. But we need to tell Pandas just how to apply the method.

If we use a statistical method on a DataFrame, Pandas defaults to performing the operation over rows for each column Series:

iris.mean()

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

We can specify this explicitly with the axis parameter:

iris.mean(axis=0)

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

If we wanted the applied method to get the mean over columns for each row, we’d set axis to \(1\).

iris.mean(axis=1)

species    obs_id
setosa     0         2.550
           1         2.375
           2         2.350
           3         2.350
           4         2.550
                     ...  
virginica  145       4.300
           146       3.925
           147       4.175
           148       4.325
           149       3.950
Length: 150, dtype: float64

If we want the statistic over all the data, irrespective of row or column position, we set axis to None.

In effect, we are reshaping the \(2\)-D DataFrame into a \(1\)-D Series and getting the mean.

iris.mean(axis=None)

3.4644999999999997

Applying Custom Functions with `.apply()`

Beyond the vectorized functions provided by Pandas, such as .mean(), we can apply custom functions to our DataFrame.

You can use regular functions here, but lambda functions are particularly appropriate.

The apply() method should be used after you have established that you can’t use an existing vectorized function.

Here we create a new feature of the square of the sepal length.

iris['sepal_len_sq'] = iris.sepal_length.apply(lambda x: x**2)
iris.head(5)

		sepal_length	sepal_width	petal_length	petal_width	sepal_len_sq
species	obs_id
setosa	0	5.1	3.5	1.4	0.2	26.01
	1	4.9	3.0	1.4	0.2	24.01
	2	4.7	3.2	1.3	0.2	22.09
	3	4.6	3.1	1.5	0.2	21.16
	4	5.0	3.6	1.4	0.2	25.00

To apply a function using multiple columns for each row operation, set axis=1.

Here, we compute the average of sepal_length, sepal_width:

iris['sepal_len_wid_avg'] = iris[['sepal_length','sepal_width']]\
    .apply(lambda x: (x.sepal_length + x.sepal_width) / 2, axis=1)
iris.head()

		sepal_length	sepal_width	petal_length	petal_width	sepal_len_sq	sepal_len_wid_avg
species	obs_id
setosa	0	5.1	3.5	1.4	0.2	26.01	4.30
	1	4.9	3.0	1.4	0.2	24.01	3.95
	2	4.7	3.2	1.3	0.2	22.09	3.95
	3	4.6	3.1	1.5	0.2	21.16	3.85
	4	5.0	3.6	1.4	0.2	25.00	4.30

Note the use of the continuation operator \ so we can write out code over multiple lines.

Note also that the argument x to the lambda function stands for the given row to which it is being applied.

lambda x: (x.sepal_length + x.sepal_width) / 2, axis=1

It’s as if we are iterating through the DataFrame rows and using the function in the body of the loop.

In fact, Pandas lets you do this with such methods as .iterrows() and .itertuples(), like so:

[(x.sepal_length + x.sepal_width) / 2 for x in iris.itertuples()][:5]

[4.3, 3.95, 3.95, 3.8499999999999996, 4.3]

And, in this case, it’s faster.

%timeit [(x.sepal_length + x.sepal_width) / 2 for x in iris.itertuples()]
%timeit iris.apply(lambda x: (x.sepal_length + x.sepal_width) / 2, axis=1)

367 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.53 ms ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Using Vectorized Operations

Whenever possible, you should see if your .apply() method call can be replaced by a simple operation.

These are typically much faster.

For example, we could achieve the same results as the above use of .apply() to square a feature.

Here are two ways to do this, along with the apply method for comparison:

%timeit iris['sepal_len_sq_2'] = iris.sepal_length**2
%timeit iris['sepal_len_sq_3'] = np.square(iris.sepal_length)

106 µs ± 617 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
104 µs ± 538 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

iris

		sepal_length	sepal_width	petal_length	petal_width	sepal_len_sq	sepal_len_wid_avg	sepal_len_sq_2	sepal_len_sq_3
species	obs_id
setosa	0	5.1	3.5	1.4	0.2	26.01	4.30	26.01	26.01
	1	4.9	3.0	1.4	0.2	24.01	3.95	24.01	24.01
	2	4.7	3.2	1.3	0.2	22.09	3.95	22.09	22.09
	3	4.6	3.1	1.5	0.2	21.16	3.85	21.16	21.16
	4	5.0	3.6	1.4	0.2	25.00	4.30	25.00	25.00
...	...	...	...	...	...	...	...	...	...
virginica	145	6.7	3.0	5.2	2.3	44.89	4.85	44.89	44.89
	146	6.3	2.5	5.0	1.9	39.69	4.40	39.69	39.69
	147	6.5	3.0	5.2	2.0	42.25	4.75	42.25	42.25
	148	6.2	3.4	5.4	2.3	38.44	4.80	38.44	38.44
	149	5.9	3.0	5.1	1.8	34.81	4.45	34.81	34.81

150 rows × 8 columns

We can see that both versions are faster than using .apply().

%timeit iris['sepal_len_sq'] = iris.sepal_length.apply(lambda x: x**2)

189 µs ± 252 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Aggregation by Groups

Pandas offers a powerful set of tools to apply aggregate statistical functions to subsets of data.

The pattern is as follows:

Group, or “split,” the data by the distinctive values in a column or set of columns.
Apply a statstic to each group.
Combine the results in a table where each row stands for one of those unique values and the feature is the aggregate statistic.

This pattern is called split-apply-combine, which we will discuss more when we get to R.

`.groupby()`

The DataFrame method we use for this pattern is .groupby().

This method takes one or more column names, including index columns, and divides the table into separate groups based the unique combinations of the values in these columns.

Then a statistical method, such as .mean(), is called.

Here we compute the mean of each feature grouped by species:

iris.groupby("species").mean()

	sepal_length	sepal_width	petal_length	petal_width	sepal_len_sq	sepal_len_wid_avg	sepal_len_sq_2	sepal_len_sq_3
species
setosa	5.006	3.428	1.462	0.246	25.1818	4.217	25.1818	25.1818
versicolor	5.936	2.770	4.260	1.326	35.4972	4.353	35.4972	35.4972
virginica	6.588	2.974	5.552	2.026	43.7980	4.781	43.7980	43.7980

We can also pass function names to the .agg() method:

iris.groupby("species").agg(['mean', 'median', 'sum', 'max', 'min']).T.head(10)

	species	setosa	versicolor	virginica
sepal_length	mean	5.006	5.936	6.588
	median	5.000	5.900	6.500
	sum	250.300	296.800	329.400
	max	5.800	7.000	7.900
	min	4.300	4.900	4.900
sepal_width	mean	3.428	2.770	2.974
	median	3.400	2.800	3.000
	sum	171.400	138.500	148.700
	max	4.400	3.400	3.800
	min	2.300	2.000	2.200

`pd.pivot_table()`

We can also use the pivot table method and apply a function aggfunc to selected value, grouped by columns.

Here we compute mean sepal length for each species:

pd.pivot_table(iris, values="sepal_length", columns=["species"], aggfunc = np.mean)

species	setosa	versicolor	virginica
sepal_length	5.006	5.936	6.588

That produces the same result as this:

iris[['sepal_length']].groupby("species").mean().T

species	setosa	versicolor	virginica
sepal_length	5.006	5.936	6.588

Stacking and Unstacking

`.unstack()`

Another way to achieve these results, but by taking advance of indexes, is unstacking.

Let’s look at what unstack() does with another dataset from Seaborn’s collection.

attention = sns.load_dataset('attention')

attention.sample(10)

	Unnamed: 0	subject	attention	solutions	score
55	55	16	focused	3	7.0
39	39	20	focused	2	6.0
26	26	7	divided	2	4.5
46	46	7	divided	3	6.0
18	18	19	focused	1	5.0
25	25	6	divided	2	5.0
21	21	2	divided	2	4.0
33	33	14	focused	2	8.0
22	22	3	divided	2	5.0
31	31	12	focused	2	9.0

This dataframe appears to record the results of an experiment on human attention.

Each row is a trial or observation in that experiment.

An analysis of the columns in this dataframe show that

score is a measured outcome,
subjects are probably individuals in a comparative study where two groups,
those with attention divided and those with attention focused,
are subject to three different solutions applied to the performance of some task.
Unnamed: 0 is just the row number as index.

The purpose of the test performed in each trial seems to be see which solutions are best at overcoming divied attention in the performance of those tasks.

Let’s restructure our data to reflect these assumptions.

attention1 = attention.set_index(['attention','solutions','subject']).sort_index().drop('Unnamed: 0', axis=1)
attention1.sample(10)

			score
attention	solutions	subject
focused	1	15	8.0
	2	11	5.0
	2	16	8.0
divided	3	3	6.0
focused	3	17	6.0
		12	8.0
		16	7.0
divided	3	4	5.0
divided	1	2	3.0
focused	2	18	8.0

We can use .unstack() to provide a nice, synoptic view of these data.

The operation takes the right-most index column and projects onto the column-axis.

Note that it requires a MultiIndex of at least two index columns.

attention2 = attention1.unstack()
attention2.fillna('-')

		score
	subject	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
attention	solutions
divided	1	2.0	3.0	3.0	5.0	4.0	5.0	5.0	5.0	2.0	6.0	-	-	-	-	-	-	-	-	-	-
	2	4.0	4.0	5.0	7.0	5.0	5.0	4.5	7.0	3.0	5.0	-	-	-	-	-	-	-	-	-	-
	3	7.0	5.0	6.0	5.0	8.0	6.0	6.0	8.0	7.0	6.0	-	-	-	-	-	-	-	-	-	-
focused	1	-	-	-	-	-	-	-	-	-	-	6.0	8.0	6.0	8.0	8.0	6.0	7.0	7.0	5.0	6.0
	2	-	-	-	-	-	-	-	-	-	-	5.0	9.0	5.0	8.0	8.0	8.0	7.0	8.0	6.0	6.0
	3	-	-	-	-	-	-	-	-	-	-	6.0	8.0	9.0	7.0	7.0	7.0	6.0	6.0	6.0	5.0

Right away we can see that participants were matched to attention mode.

To complete the pattern, we can apply .mean() to the unstacked DataFrame.

ATTENTION = attention2.mean(axis=1).to_frame('score')
ATTENTION

		score
attention	solutions
divided	1	4.00
	2	4.95
	3	6.40
focused	1	6.70
	2	7.00
	3	6.70

We can apply .unstack() again to show the results more compactly:

ATTENTION.unstack()

	score
solutions	1	2	3
attention
divided	4.0	4.95	6.4
focused	6.7	7.00	6.7

It appears that solution 3 performed well in overcoming divided attention.

`.stack()`

Stack is the opposite of .unstack(), of course.

It will project column labels onto the values of a single column.

Let’s look at this with the taxis database.

taxis = sns.load_dataset('taxis')
taxis.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan

Let’s define our indexes to reflect the structure of the data.

We define the index, sort it, and then use it to perform a stacking operation.

taxis1 = taxis.set_index(['pickup','dropoff']).sort_index().stack().to_frame('val')
taxis1.index.names = ['pickup','dropoff','field']
taxis1.sample(10)

			val
pickup	dropoff	field
2019-03-09 19:47:10	2019-03-09 20:09:17	dropoff_borough	Manhattan
2019-03-20 23:19:55	2019-03-20 23:46:00	distance	4.82
2019-03-03 07:48:48	2019-03-03 07:53:53	distance	1.57
2019-03-04 08:30:51	2019-03-04 08:40:54	pickup_zone	Upper East Side South
2019-03-16 11:39:52	2019-03-16 11:50:36	distance	1.8
2019-03-03 04:11:09	2019-03-03 04:17:14	distance	0.8
2019-03-18 16:03:14	2019-03-18 16:10:36	pickup_borough	Manhattan
2019-03-31 12:03:39	2019-03-31 12:12:51	pickup_zone	Chinatown
2019-03-12 16:00:34	2019-03-12 16:06:28	passengers	1
2019-03-05 21:39:03	2019-03-05 21:49:12	pickup_zone	Clinton East

Here is the data for one observation:

taxis1.loc['2019-02-28 23:29:03']

		val
dropoff	field
2019-02-28 23:32:35	passengers	1
	distance	0.9
	fare	5.0
	tip	0.0
	tolls	0.0
	total	6.3
	color	green
	payment	cash
	pickup_zone	Old Astoria
	dropoff_zone	Long Island City/Queens Plaza
	pickup_borough	Queens
	dropoff_borough	Queens

Combining DataFrames

There are at least \(3\) ways to combine DataFrame in Pandas: by concatening, merging, or joining.

`pd.concat()`

To concatenate to DataFrames, we use pd.concat().

Here’s an example.

df1 = pd.DataFrame(np.random.randn(3, 4))
df2 = pd.DataFrame(np.random.randn(3, 4))

df1

	0	1	2	3
0	0.361659	0.802342	-0.986953	0.350928
1	0.258543	-0.106647	1.467904	1.238903
2	2.896572	0.739011	0.705529	-0.397988

df2

	0	1	2	3
0	0.023824	1.013029	0.200236	0.274996
1	2.256482	0.982360	0.933824	0.774376
2	-0.203034	1.013550	-0.812464	-2.777192

The Pandas function takes a list of DataFrames to combine, an optional keys argument to create a MultiIndex, and an axis parameter to choose between combining by rows or columns.

To combine by rows, set axis to \(0\).

df3 = pd.concat([df1, df2], keys=['a','b'], axis=0)

df3

		0	1	2	3
a	0	0.361659	0.802342	-0.986953	0.350928
	1	0.258543	-0.106647	1.467904	1.238903
	2	2.896572	0.739011	0.705529	-0.397988
b	0	0.023824	1.013029	0.200236	0.274996
	1	2.256482	0.982360	0.933824	0.774376
	2	-0.203034	1.013550	-0.812464	-2.777192

We set keys because indexes of the two source DataFrames are not unique.

Here we use the key to get back the second DataFrame.

df3.loc['b']

	0	1	2	3
0	0.023824	1.013029	0.200236	0.274996
1	2.256482	0.982360	0.933824	0.774376
2	-0.203034	1.013550	-0.812464	-2.777192

To combine by columns, we set the axis to \(1\).

This assumes that the two DataFrames share an index, i.e. that the indexes represent the same events.

df4 = pd.concat([df1, df2], axis=1, keys=['a', 'b'])

df4

	a				b
	0	1	2	3	0	1	2	3
0	0.361659	0.802342	-0.986953	0.350928	0.023824	1.013029	0.200236	0.274996
1	0.258543	-0.106647	1.467904	1.238903	2.256482	0.982360	0.933824	0.774376
2	2.896572	0.739011	0.705529	-0.397988	-0.203034	1.013550	-0.812464	-2.777192

df4.b

	0	1	2	3
0	0.023824	1.013029	0.200236	0.274996
1	2.256482	0.982360	0.933824	0.774376
2	-0.203034	1.013550	-0.812464	-2.777192

`.merge()`

Another way to combine DataFrames is with the Pandas function pd.marge().

It works similar to a SQL join.

This function takes two DataFrames as its first and second argument.

The on parameter specifies the columns on which to join.

The how parameter specifies the type of merge, i.e. left, right, outer, inner, or cross.

Here we create two tables, left and right.

We then right join them on key.

Right join means include all records from table on right.

The key is used for matching up the records.

left = pd.DataFrame({"key": ["jamie", "bill"], "lval": [15, 22]})
right = pd.DataFrame({"key": ["jamie", "bill", "asher"], "rval": [4, 5, 8]})

left

	key	lval
0	jamie	15
1	bill	22

right

	key	rval
0	jamie	4
1	bill	5
2	asher	8

merged = pd.merge(left, right, on="key", how="right")
merged

	key	lval	rval
0	jamie	15.0	4
1	bill	22.0	5
2	asher	NaN	8

Notice the NaN inserted into the record with key='asher', since the left table didn’t contain the key.

In this next example, the value columns have the same name: val. Notice what happens to the column names.

left = pd.DataFrame({"key": ["jamie", "bill"], "val": [15, 22]})
right = pd.DataFrame({"key": ["jamie", "bill", "asher"], "val": [4, 5, 8]})
merged = pd.merge(left, right, on="key", how="right")

left

	key	val
0	jamie	15
1	bill	22

right

	key	val
0	jamie	4
1	bill	5
2	asher	8

merged

	key	val_x	val_y
0	jamie	15.0	4
1	bill	22.0	5
2	asher	NaN	8

`.join()`

As the name implies, .join() is also SQL-like joiner, but it takes advantage of indexes.

It assumes the DataFrames to be joined share index values.

Here we redefine our DataFrames with indexes.

left2 = left.set_index('key').copy()
right2 = right.set_index('key').copy()

left2

	val
key
jamie	15
bill	22

right2

	val
key
jamie	4
bill	5
asher	8

Now we just use the method and Pandas assumes the indexes are shared.

We define a right suffix rsuffix to ensure the column names are unique.

right2.join(left2, rsuffix='_r')

	val	val_r
key
jamie	4	15.0
bill	5	22.0
asher	8	NaN

This is an inner join:

right2.join(left2, how='inner', rsuffix='_r')

	val	val_r
key
jamie	4	15
bill	5	22

Things to Pay Attention To

Use join if you have shared indexes.

Use merge if you do not have shared indexes.

Use concat to combine based on shared indexes or columns

Pay attention to resulting dataframe indexes and column names.

Categoricals

It is often necessary to convert categorical data into numeric form.

Categorical data are data in which the values are members of a set, such as the species names in the Iris data set.

In machine learning, we often want to project these onto the column axis where each row can only contain one True value.

`pd.get_dummies()`

This method will project a list of values in a column onto the column axis, using distinct values as the column names.

This is similar to .unstack(), except that the values for each of the new columns is boolean.

This is also called “one-hot encoding,” because only one feature can be “hot,” i.e. True, for each row.

Here a some important parameters for the function:

prefix : append prefix to column names (a good idea for later use)
drop_first: remove first level, as only k-1 variables needed to represent k levels

Let’s look at example.

Here we define a little DataFrame of cats, with a feature for breed.

cats = pd.DataFrame({'breed':['persian', 'persian', 'siamese', 'himalayan', 'burmese']})

cats

	breed
0	persian
1	persian
2	siamese
3	himalayan
4	burmese

To convert to one-hot encoded form, we can do the following:

dummy_cats = pd.get_dummies(cats.breed, prefix='breed')
dummy_cats

	breed_burmese	breed_himalayan	breed_persian	breed_siamese
0	False	False	True	False
1	False	False	True	False
2	False	False	False	True
3	False	True	False	False
4	True	False	False	False

We can drop the first column, since it can be inferred.

This is done improve the efficiency of training a model.

pd.get_dummies(cats.breed, drop_first=True, prefix='breed')

	breed_himalayan	breed_persian	breed_siamese
0	False	True	False
1	False	True	False
2	False	False	True
3	True	False	False
4	False	False	False

Notice burmese was dropped (first level by alphabet) since it can be inferred.

Setting Up

Getting Statitistics by Axis

Applying Custom Functions with .apply()

Using Vectorized Operations

Aggregation by Groups

.groupby()

pd.pivot_table()

Stacking and Unstacking

.unstack()

.stack()