import numpy as np
import pandas as pd
import seaborn as sns
= sns.load_dataset('iris') iris
NB: Introducing Pandas II
Set Up
2) iris.head(
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
import sys
sys.getsizeof(iris)
14764
Apply Lambda Functions with .apply()
Apply a transformation to each record. Uses a lambda
function.
The apply()
method should be used after you have established that you can’t use a vectorized function.
'sepal_len_sq'] = iris.sepal_length.apply(lambda x: x**2)
iris[5) iris.head(
sepal_length | sepal_width | petal_length | petal_width | species | sepal_len_sq | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 26.01 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 24.01 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 22.09 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 21.16 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 25.00 |
Transformation involving multiple columns. Uses axis=1
to access columns.
Compute average of sepal_length
, sepal_width
:
'sepal_len_wid_avg'] = iris[['sepal_length','sepal_width']]\
iris[apply(lambda x: (x.sepal_length + x.sepal_width) / 2, axis=1)
. iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | sepal_len_sq | sepal_len_wid_avg | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 26.01 | 4.30 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 24.01 | 3.95 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 22.09 | 3.95 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 21.16 | 3.85 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 25.00 | 4.30 |
Vectorized Version
%time iris.sepal_length**2
CPU times: user 306 µs, sys: 18 µs, total: 324 µs
Wall time: 343 µs
0 26.01
1 24.01
2 22.09
3 21.16
4 25.00
...
145 44.89
146 39.69
147 42.25
148 38.44
149 34.81
Name: sepal_length, Length: 150, dtype: float64
Compare to .apply()
%time iris.sepal_length.apply(lambda x: x**2)
CPU times: user 830 µs, sys: 200 µs, total: 1.03 ms
Wall time: 1.23 ms
0 26.01
1 24.01
2 22.09
3 21.16
4 25.00
...
145 44.89
146 39.69
147 42.25
148 38.44
149 34.81
Name: sepal_length, Length: 150, dtype: float64
Aggregation
Involves one or more of:
- splitting the data into groups
- applying a function to each group
- combining results
.groupby()
Compute mean of each column, grouped (separately) by species
"species").mean() iris.groupby(
sepal_length | sepal_width | petal_length | petal_width | sepal_len_sq | sepal_len_wid_avg | |
---|---|---|---|---|---|---|
species | ||||||
setosa | 5.006 | 3.428 | 1.462 | 0.246 | 25.1818 | 4.217 |
versicolor | 5.936 | 2.770 | 4.260 | 1.326 | 35.4972 | 4.353 |
virginica | 6.588 | 2.974 | 5.552 | 2.026 | 43.7980 | 4.781 |
pd.pivot_table()
Apply a function aggfunc
to selected values grouped by columns
Compute mean sepal length for each species:
="sepal_length", columns=["species"], aggfunc = np.mean) pd.pivot_table(iris, values
species | setosa | versicolor | virginica |
---|---|---|---|
sepal_length | 5.006 | 5.936 | 6.588 |
Stacking and Unstacking
Similar to pivoting, but requires – and takes advantage of – indexes.
.unstack()
Let’s look at what unstack()
does with a dataset from Seaborn’s collection.
= sns.load_dataset('attention') attention
10) attention.sample(
Unnamed: 0 | subject | attention | solutions | score | |
---|---|---|---|---|---|
5 | 5 | 6 | divided | 1 | 5.0 |
40 | 40 | 1 | divided | 3 | 7.0 |
11 | 11 | 12 | focused | 1 | 8.0 |
12 | 12 | 13 | focused | 1 | 6.0 |
24 | 24 | 5 | divided | 2 | 5.0 |
9 | 9 | 10 | divided | 1 | 6.0 |
0 | 0 | 1 | divided | 1 | 2.0 |
37 | 37 | 18 | focused | 2 | 8.0 |
7 | 7 | 8 | divided | 1 | 5.0 |
22 | 22 | 3 | divided | 2 | 5.0 |
This dataframe appears to record the results of an experiment on human attention.
Each row is a trial or observation in that experiment.
An analysis of the columns in this dataframe show that score
is a measured outcome, subject
s are probably individuals in a comparative study where two groups, those with attention
divided
and those with attention
focused
, are subject to three different solutions
applied to the performance of some task. Unnamed: 0
is just the row number as index.
The purpose of the test performed in each trial seems to be see which solutions are best at overcoming divied attention in the performance of those tasks.
Let’s restructure our data to reflect these assumptions.
= attention.set_index(['attention','solutions','subject']).sort_index().drop('Unnamed: 0', axis=1) attention1
20) attention1.head(
score | |||
---|---|---|---|
attention | solutions | subject | |
divided | 1 | 1 | 2.0 |
2 | 3.0 | ||
3 | 3.0 | ||
4 | 5.0 | ||
5 | 4.0 | ||
6 | 5.0 | ||
7 | 5.0 | ||
8 | 5.0 | ||
9 | 2.0 | ||
10 | 6.0 | ||
2 | 1 | 4.0 | |
2 | 4.0 | ||
3 | 5.0 | ||
4 | 7.0 | ||
5 | 5.0 | ||
6 | 5.0 | ||
7 | 4.5 | ||
8 | 7.0 | ||
9 | 3.0 | ||
10 | 5.0 |
We can use .unstack()
to provide a nice, synoptic view of these data.
= attention1.unstack()
attention2 '') attention2.fillna(
score | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
subject | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
attention | solutions | ||||||||||||||||||||
divided | 1 | 2.0 | 3.0 | 3.0 | 5.0 | 4.0 | 5.0 | 5.0 | 5.0 | 2.0 | 6.0 | ||||||||||
2 | 4.0 | 4.0 | 5.0 | 7.0 | 5.0 | 5.0 | 4.5 | 7.0 | 3.0 | 5.0 | |||||||||||
3 | 7.0 | 5.0 | 6.0 | 5.0 | 8.0 | 6.0 | 6.0 | 8.0 | 7.0 | 6.0 | |||||||||||
focused | 1 | 6.0 | 8.0 | 6.0 | 8.0 | 8.0 | 6.0 | 7.0 | 7.0 | 5.0 | 6.0 | ||||||||||
2 | 5.0 | 9.0 | 5.0 | 8.0 | 8.0 | 8.0 | 7.0 | 8.0 | 6.0 | 6.0 | |||||||||||
3 | 6.0 | 8.0 | 9.0 | 7.0 | 7.0 | 7.0 | 6.0 | 6.0 | 6.0 | 5.0 |
We can see clearly the data from two groups by attention
, each consisting of 10 subject
s, each employing three solutions.
By unstacking again, we can get a sense of which solution worked best.
1).unstack().T.style.background_gradient(axis=None) attention2.mean(
attention | divided | focused |
---|---|---|
solutions | ||
1 | 4.000000 | 6.700000 |
2 | 4.950000 | 7.000000 |
3 | 6.400000 | 6.700000 |
It appears the solution 3 performed well.
.stack()
Stack is the opposite of .unstack()
, of course. It will project a column name series into the values of a single column.
Let look at this with the taxis
database.
= sns.load_dataset('taxis') taxis
taxis.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
= taxis.set_index(['pickup','dropoff']).sort_index().stack().to_frame('val')
taxis1 = ['pickup','dropoff','field'] taxis1.index.names
10) taxis1.sample(
val | |||
---|---|---|---|
pickup | dropoff | field | |
2019-03-23 13:13:20 | 2019-03-23 13:15:13 | dropoff_zone | Hudson Sq |
2019-03-21 01:30:10 | 2019-03-21 01:50:15 | tolls | 0.0 |
2019-03-04 14:13:31 | 2019-03-04 14:24:15 | tolls | 0.0 |
2019-03-06 21:00:00 | 2019-03-06 21:04:45 | passengers | 1 |
2019-03-08 21:43:33 | 2019-03-08 21:45:48 | tolls | 0.0 |
2019-03-07 14:27:58 | 2019-03-07 14:45:49 | pickup_borough | Queens |
2019-03-14 15:11:21 | 2019-03-14 15:21:02 | dropoff_zone | Upper West Side North |
2019-03-14 16:49:58 | 2019-03-14 17:07:29 | fare | 14.5 |
2019-03-13 16:16:05 | 2019-03-13 16:28:11 | dropoff_borough | Manhattan |
2019-03-10 16:13:48 | 2019-03-10 16:17:53 | total | 7.8 |
'2019-02-28 23:29:03'] taxis1.loc[
val | ||
---|---|---|
dropoff | field | |
2019-02-28 23:32:35 | passengers | 1 |
distance | 0.9 | |
fare | 5.0 | |
tip | 0.0 | |
tolls | 0.0 | |
total | 6.3 | |
color | green | |
payment | cash | |
pickup_zone | Old Astoria | |
dropoff_zone | Long Island City/Queens Plaza | |
pickup_borough | Queens | |
dropoff_borough | Queens |
Combining DataFrames
pd.concat()
Concatenate pandas objects along an axis.
Create two dfs and vertically stack them
= pd.DataFrame(np.random.randn(3, 4))
df1 = pd.DataFrame(np.random.randn(3, 4)) df2
df1
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.334874 | -1.179623 | -0.369825 | 0.748533 |
1 | -0.376463 | -1.813195 | -0.342195 | -0.732275 |
2 | 0.184883 | -0.745494 | 0.503854 | 0.497544 |
df2
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.411402 | -0.543350 | -0.020294 | 0.789886 |
1 | 0.786837 | 0.960382 | -1.093336 | 0.551457 |
2 | 0.582368 | 1.606100 | 1.715443 | -0.481616 |
Concat rows
= pd.concat([df1, df2], axis=0) df3
df3
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.334874 | -1.179623 | -0.369825 | 0.748533 |
1 | -0.376463 | -1.813195 | -0.342195 | -0.732275 |
2 | 0.184883 | -0.745494 | 0.503854 | 0.497544 |
0 | 1.411402 | -0.543350 | -0.020294 | 0.789886 |
1 | 0.786837 | 0.960382 | -1.093336 | 0.551457 |
2 | 0.582368 | 1.606100 | 1.715443 | -0.481616 |
Concat columns
This assumes that the indexes represent IDs of specific things or events.
= pd.concat([df1, df2], axis=1, keys=['foo', 'bar']) df4
df4
foo | bar | |||||||
---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | |
0 | -0.334874 | -1.179623 | -0.369825 | 0.748533 | 1.411402 | -0.543350 | -0.020294 | 0.789886 |
1 | -0.376463 | -1.813195 | -0.342195 | -0.732275 | 0.786837 | 0.960382 | -1.093336 | 0.551457 |
2 | 0.184883 | -0.745494 | 0.503854 | 0.497544 | 0.582368 | 1.606100 | 1.715443 | -0.481616 |
df4.foo
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.334874 | -1.179623 | -0.369825 | 0.748533 |
1 | -0.376463 | -1.813195 | -0.342195 | -0.732275 |
2 | 0.184883 | -0.745494 | 0.503854 | 0.497544 |
.merge()
SQL-style joining of tables (DataFrames) – although Pandas has a .join()
method, too.
Important parameters include:
how
: type of merge {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’on
: names to join on
Create two tables, left
and right
. Then right join them on key
.
Right join means include all records from table on right.
The key
is used for matching up the records.
= pd.DataFrame({"key": ["jamie", "bill"], "lval": [15, 22]})
left = pd.DataFrame({"key": ["jamie", "bill", "asher"], "rval": [4, 5, 8]})
right = pd.merge(left, right, on="key", how="right") merged
left
key | lval | |
---|---|---|
0 | jamie | 15 |
1 | bill | 22 |
right
key | rval | |
---|---|---|
0 | jamie | 4 |
1 | bill | 5 |
2 | asher | 8 |
merged
key | lval | rval | |
---|---|---|---|
0 | jamie | 15.0 | 4 |
1 | bill | 22.0 | 5 |
2 | asher | NaN | 8 |
Notice the NaN
inserted into the record with key='asher'
, since the left table didn’t contain the key.
Matching column names
In this next example, the value columns have the same name: val. Notice what happens to the column names.
= pd.DataFrame({"key": ["jamie", "bill"], "val": [15, 22]})
left = pd.DataFrame({"key": ["jamie", "bill", "asher"], "val": [4, 5, 8]})
right = pd.merge(left, right, on="key", how="right") merged
left
key | val | |
---|---|---|
0 | jamie | 15 |
1 | bill | 22 |
right
key | val | |
---|---|---|
0 | jamie | 4 |
1 | bill | 5 |
2 | asher | 8 |
merged
key | val_x | val_y | |
---|---|---|---|
0 | jamie | 15.0 | 4 |
1 | bill | 22.0 | 5 |
2 | asher | NaN | 8 |
.join()
An SQL-like joiner, but this one takes advantage of indexes.
Give our dataframes indexes and distinctive columns names.
= left.set_index('key').copy()
left2 = right.set_index('key').copy() right2
left2
val | |
---|---|
key | |
jamie | 15 |
bill | 22 |
right2
val | |
---|---|
key | |
jamie | 4 |
bill | 5 |
asher | 8 |
='_y') # Defaults to 'left' right2.join(left2, rsuffix
val | val_y | |
---|---|---|
key | ||
jamie | 4 | 15.0 |
bill | 5 | 22.0 |
asher | 8 | NaN |
='_y', how='inner') right2.join(left2, rsuffix
val | val_y | |
---|---|---|
key | ||
jamie | 4 | 15 |
bill | 5 | 22 |
Summary
- Use join if you have shared indexes
- Use merge if you do not have shared indexes
- Use concat to combine based on shared indexes or columns
- Pay attention to resulting dataframe indexes and column names
Reshape with .reshape()
Changes the object’s shape
We illustrate creating pandas Series, extracting array of length 6, and reshaping to 3x2 array.
Create a series:
= pd.Series([1, 1, 2, 3, 5, 8]) ser
Extract values:
= ser.values vals
vals
array([1, 1, 2, 3, 5, 8])
type(vals)
numpy.ndarray
vals.shape
(6,)
Reshaping a series:
= vals.reshape((3, 2)) reshaped_vals
reshaped_vals
array([[1, 1],
[2, 3],
[5, 8]])
type(reshaped_vals)
numpy.ndarray
reshaped_vals.shape
(3, 2)
Including -1 as one of the dimensions tells numpy: infer this dimension from the data and the other dimensions.
Example: enforce 3 columns:
-1,3) vals.reshape(
array([[1, 1, 2],
[3, 5, 8]])
Enforce 3 rows:
3,-1) vals.reshape(
array([[1, 1],
[2, 3],
[5, 8]])
Notice the shape of original array: (6,)
.
This is a vector with one dimension, and is different from two-dimensional (6,1)
array.
Categoricals
Categorical data takes discrete values where computation on the values does not make sense.
Zip code is a typical example.
To include categoricals in models, often they must be converted to numeric form.
get_dummies()
Dummy code categorical data
Important parameters:
prefix
: append prefix to column names (a good idea for later use)drop_first
: remove first level, as onlyk-1
variables needed to representk
levels
= pd.DataFrame({'breed':['persian', 'persian', 'siamese', 'himalayan', 'burmese']}) cats
cats
breed | |
---|---|
0 | persian |
1 | persian |
2 | siamese |
3 | himalayan |
4 | burmese |
= pd.get_dummies(cats.breed, prefix='breed') dummy_cats
dummy_cats
breed_burmese | breed_himalayan | breed_persian | breed_siamese | |
---|---|---|---|---|
0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 1 | 0 |
2 | 0 | 0 | 0 | 1 |
3 | 0 | 1 | 0 | 0 |
4 | 1 | 0 | 0 | 0 |
=True, prefix='breed') pd.get_dummies(cats.breed, drop_first
breed_himalayan | breed_persian | breed_siamese | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 0 |
4 | 0 | 0 | 0 |
Notice burmese
was dropped (first level by alphabet) since it can be inferred.
Let’s try it on the iris
dataset.
iris
sepal_length | sepal_width | petal_length | petal_width | species | sepal_len_sq | sepal_len_wid_avg | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 26.01 | 4.30 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 24.01 | 3.95 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 22.09 | 3.95 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 21.16 | 3.85 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 25.00 | 4.30 |
... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 44.89 | 4.85 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 39.69 | 4.40 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 42.25 | 4.75 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 38.44 | 4.80 |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 34.81 | 4.45 |
150 rows × 7 columns
Called get_dummies()
by itself will handle all categoricals for you.
Look at what happened to the species
column.
pd.get_dummies(iris)
sepal_length | sepal_width | petal_length | petal_width | sepal_len_sq | sepal_len_wid_avg | species_setosa | species_versicolor | species_virginica | |
---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 26.01 | 4.30 | 1 | 0 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 24.01 | 3.95 | 1 | 0 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 22.09 | 3.95 | 1 | 0 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 21.16 | 3.85 | 1 | 0 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 25.00 | 4.30 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | 44.89 | 4.85 | 0 | 0 | 1 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | 39.69 | 4.40 | 0 | 0 | 1 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | 42.25 | 4.75 | 0 | 0 | 1 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | 38.44 | 4.80 | 0 | 0 | 1 |
149 | 5.9 | 3.0 | 5.1 | 1.8 | 34.81 | 4.45 | 0 | 0 | 1 |
150 rows × 9 columns
You can call it one numeric columns, too.
sum().plot.bar() pd.get_dummies(iris.sepal_length).
<Axes: >