Python Pandas Exercises¶
Programming for Data Science Bootcamp
Set Up¶
import pandas as pd
import numpy as np
import seaborn as sns
np.__version__
'1.26.4'
pd.__version__
'2.2.2'
Exercise 1¶
Create a data frame called df by passing a dictionary of inputs with the following keys and values:
feature containing a list of $5$ floats.
labels containing a list of $5$ integers between $0$ and $2$ inclusive.
Hint: Pass the dict to pd.DataFrame().
Then show the data frame.
df = pd.DataFrame({
'feature': [0.2, -1.1, 1.6, 5.4],
'labels': [1, 1, 0, 2]
})
df
| feature | labels | |
|---|---|---|
| 0 | 0.2 | 1 |
| 1 | -1.1 | 1 |
| 2 | 1.6 | 0 |
| 3 | 5.4 | 2 |
Exercise 2¶
Rename the labels column in df to label.
df = df.rename(columns={'labels': 'label'})
df
| feature | label | |
|---|---|---|
| 0 | 0.2 | 1 |
| 1 | -1.1 | 1 |
| 2 | 1.6 | 0 |
| 3 | 5.4 | 2 |
Exercise 3¶
Import the iris dataset from Seaborn and the sort by species, descending.
Hint:
iris = sns.load_dataset('iris')
iris = sns.load_dataset('iris')
iris.sort_values('species', ascending=False)
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
| 111 | 6.4 | 2.7 | 5.3 | 1.9 | virginica |
| 122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
| 121 | 5.6 | 2.8 | 4.9 | 2.0 | virginica |
| 120 | 6.9 | 3.2 | 5.7 | 2.3 | virginica |
| ... | ... | ... | ... | ... | ... |
| 31 | 5.4 | 3.4 | 1.5 | 0.4 | setosa |
| 30 | 4.8 | 3.1 | 1.6 | 0.2 | setosa |
| 29 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
| 28 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
150 rows × 5 columns
iris.sort_values('species', ascending=False)
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
| 111 | 6.4 | 2.7 | 5.3 | 1.9 | virginica |
| 122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
| 121 | 5.6 | 2.8 | 4.9 | 2.0 | virginica |
| 120 | 6.9 | 3.2 | 5.7 | 2.3 | virginica |
| ... | ... | ... | ... | ... | ... |
| 31 | 5.4 | 3.4 | 1.5 | 0.4 | setosa |
| 30 | 4.8 | 3.1 | 1.6 | 0.2 | setosa |
| 29 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
| 28 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
150 rows × 5 columns
Exercise 4¶
Use .value_counts() to find out how many records of each species the dataset has.
iris.species.value_counts().to_frame('n')
| n | |
|---|---|
| species | |
| setosa | 50 |
| versicolor | 50 |
| virginica | 50 |
iris
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
Exercise 5¶
Show the number of records in iris where petal_length $<= 1.4$ or petal_length $>= 1.6$
Hint: Remember how to implement "or" in Pandas.
Hint: Subset using boolean indexing, and count the number of resulting records with len() or .shape[0].
len(iris[(iris.petal_length <= 1.4) | (iris.petal_length >= 1.6)])
137
iris[(iris.petal_length <= 1.4) | (iris.petal_length >= 1.6)].shape[0]
137
Exercise 6¶
Use .apply() to append a new column that displays the minimum of (petal_length, petal_width) for each observation.
Display the head, tail of the new data frame to check if things look correct.
Hint: Use a lambda function to compute the minimum.
iris['min_petal'] = iris.apply(lambda x: min(x.petal_length, \
x.petal_width), axis=1)
iris.head()
| sepal_length | sepal_width | petal_length | petal_width | species | min_petal | |
|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 0.2 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 0.2 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 0.2 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 0.2 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 0.2 |
iris.tail()
| sepal_length | sepal_width | petal_length | petal_width | species | min_petal | |
|---|---|---|---|---|---|---|
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 2.3 |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 1.9 |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 2.0 |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 2.3 |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 1.8 |
We can see that in all cases petal length is greater than petal width.
sum(iris.petal_length <= iris.petal_width)
0
iris[['petal_length','petal_width']].idxmin(axis=1).value_counts()
petal_width 150 Name: count, dtype: int64
Exercise 7¶
Use a pivot table to compute the following statistics on sepal_width and petal_width grouped by species:
- median
- mean
These can be computed together in a single call to pd.pivot_table().
pd.pivot_table(iris,
values = ["sepal_width", "petal_width"],
columns = ["species"],
aggfunc = {'mean', 'median'}) # Braces groups rows
| species | setosa | versicolor | virginica | |
|---|---|---|---|---|
| petal_width | mean | 0.246 | 1.326 | 2.026 |
| median | 0.200 | 1.300 | 2.000 | |
| sepal_width | mean | 3.428 | 2.770 | 2.974 |
| median | 3.400 | 2.800 | 3.000 |
pd.pivot_table(iris,
values = ["sepal_width", "petal_width"],
columns = ["species"],
aggfunc = ['mean', 'median']) # Brackets groups cols
| mean | median | |||||
|---|---|---|---|---|---|---|
| species | setosa | versicolor | virginica | setosa | versicolor | virginica |
| petal_width | 0.246 | 1.326 | 2.026 | 0.2 | 1.3 | 2.0 |
| sepal_width | 3.428 | 2.770 | 2.974 | 3.4 | 2.8 | 3.0 |
pd.pivot_table(iris,
values = ["sepal_width", "petal_width"],
columns = ["species"],
aggfunc = ['mean', 'median']).T # Brackets groups cols
| petal_width | sepal_width | ||
|---|---|---|---|
| species | |||
| mean | setosa | 0.246 | 3.428 |
| versicolor | 1.326 | 2.770 | |
| virginica | 2.026 | 2.974 | |
| median | setosa | 0.200 | 3.400 |
| versicolor | 1.300 | 2.800 | |
| virginica | 2.000 | 3.000 |
Exercise 8¶
Create two data frames with a common index column, first_name, and a single feature column.
In the first, name the feature column age and have two observations.
In the second, name the feature column height and have three observations.
Make sure two of the index values in the two data frames are shared.
Also, remember that indexes should have unique values.
Hint: Pass dictionaries to the data frame constructor.
left = pd.DataFrame({"key": ["jamie", "bill"], "age": [15, 22]})
left = left.set_index('key')
left
| age | |
|---|---|
| key | |
| jamie | 15 |
| bill | 22 |
right = pd.DataFrame({"key": ["jamie", "bill", "asher"], "height": [6, 5, 7]})
right = right.set_index('key')
right
| height | |
|---|---|
| key | |
| jamie | 6 |
| bill | 5 |
| asher | 7 |
Exercise 9¶
With the data frames you just created, use pd.merge() to join the tables using an inner join.
Then use df.join() to join the tables using an outer join.
inner = pd.merge(left, right, on="key", how="inner")
inner
| age | height | |
|---|---|---|
| key | ||
| jamie | 15 | 6 |
| bill | 22 | 5 |
outer = left.join(right, how='outer')
outer
| age | height | |
|---|---|---|
| key | ||
| asher | NaN | 7 |
| bill | 22.0 | 5 |
| jamie | 15.0 | 6 |
Exercise 10¶
Create a series with the data $[1, 1, 2, 3, 5, 8]$.
Extract the data from the series in a NumPy array and reshape to $2 \times 3$.
Print both the reshaped data and its dimensions.
ser = pd.Series([1, 1, 2, 3, 5, 8])
vals = ser.values
resh = vals.reshape(2,3)
resh
array([[1, 1, 2],
[3, 5, 8]])
resh.shape
(2, 3)
Exercise 11¶
The data frame below contains two categoricals.
Dummify each of them, giving them an appropriate prefix and dropping the first level from each.
cats = pd.DataFrame({
'breed':['persian','persian','siamese','himalayan','burmese'],
'color':['calico','white','seal point','cream','sable']
})
Hint: You can pass a list to the prefix parameter to apply prefixes to more than one column.
cats = pd.DataFrame({
'breed':['persian', 'persian', 'siamese', 'himalayan', 'burmese'],
'color':['calico', 'white', 'seal point', 'cream', 'sable']
})
cats
| breed | color | |
|---|---|---|
| 0 | persian | calico |
| 1 | persian | white |
| 2 | siamese | seal point |
| 3 | himalayan | cream |
| 4 | burmese | sable |
cats_combo = pd.get_dummies(cats, prefix=['b','c'])
cats_combo
| b_burmese | b_himalayan | b_persian | b_siamese | c_calico | c_cream | c_sable | c_seal point | c_white | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | True | False | True | False | False | False | False |
| 1 | False | False | True | False | False | False | False | False | True |
| 2 | False | False | False | True | False | False | False | True | False |
| 3 | False | True | False | False | False | True | False | False | False |
| 4 | True | False | False | False | False | False | True | False | False |
No prefix
pd.get_dummies(cats)
| breed_burmese | breed_himalayan | breed_persian | breed_siamese | color_calico | color_cream | color_sable | color_seal point | color_white | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | True | False | True | False | False | False | False |
| 1 | False | False | True | False | False | False | False | False | True |
| 2 | False | False | False | True | False | False | False | True | False |
| 3 | False | True | False | False | False | True | False | False | False |
| 4 | True | False | False | False | False | False | True | False | False |
pd.get_dummies(cats).T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| breed_burmese | False | False | False | False | True |
| breed_himalayan | False | False | False | True | False |
| breed_persian | True | True | False | False | False |
| breed_siamese | False | False | True | False | False |
| color_calico | True | False | False | False | False |
| color_cream | False | False | False | True | False |
| color_sable | False | False | False | False | True |
| color_seal point | False | False | True | False | False |
| color_white | False | True | False | False | False |
Exercise 12¶
Create two Series of floats of unqual length and letters for index names.
In the first s1 use these arguments in the constructure pd.Series():
[7.3, -2.5, 3.4, 1.5]andindex=['a', 'c', 'd', 'e']
In the second s2 use these:
[-2.1, 3.6, -1.5, 4, 3.1]andindex=['a', 'c', 'e', 'f', 'g']
Then add the two together with the plus + operator. What do you see?
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], \
index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], \
index=['a', 'c', 'e', 'f', 'g'])
s1
a 7.3 c -2.5 d 3.4 e 1.5 dtype: float64
s2
a -2.1 c 3.6 e -1.5 f 4.0 g 3.1 dtype: float64
s1 + s2
a 5.2 c 1.1 d NaN e 0.0 f NaN g NaN dtype: float64
Note that Pandas assumens the indexes belong to a common domain.
Demonstration¶
We demonstrate the same principle of interpolation with data frames.
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), \
columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), \
columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
| b | c | d | |
|---|---|---|---|
| Ohio | 0.0 | 1.0 | 2.0 |
| Texas | 3.0 | 4.0 | 5.0 |
| Colorado | 6.0 | 7.0 | 8.0 |
df2
| b | d | e | |
|---|---|---|---|
| Utah | 0.0 | 1.0 | 2.0 |
| Ohio | 3.0 | 4.0 | 5.0 |
| Texas | 6.0 | 7.0 | 8.0 |
| Oregon | 9.0 | 10.0 | 11.0 |
df1 + df2
| b | c | d | e | |
|---|---|---|---|---|
| Colorado | NaN | NaN | NaN | NaN |
| Ohio | 3.0 | NaN | 6.0 | NaN |
| Oregon | NaN | NaN | NaN | NaN |
| Texas | 9.0 | NaN | 12.0 | NaN |
| Utah | NaN | NaN | NaN | NaN |
(df1 + df2).fillna(0).style\
.background_gradient(axis=None, cmap='YlGnBu')
| b | c | d | e | |
|---|---|---|---|---|
| Colorado | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| Ohio | 3.000000 | 0.000000 | 6.000000 | 0.000000 |
| Oregon | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| Texas | 9.000000 | 0.000000 | 12.000000 | 0.000000 |
| Utah | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Exercise 13¶
Using the penguins dataset from Seaborn, generate a correlation table of the numeric features in the table.
Then create a thin table using pd.unstack().
Name the index columns f1 and f2 and the data column r.
As a bonus, only display rows where f1 and f2 are not identical and where there are no reversals. For example, if ('a','b') exists in the index, then ('b', 'a') does not.
penguins = sns.load_dataset('penguins')
W = penguins.corr(numeric_only=True)
W
| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
|---|---|---|---|---|
| bill_length_mm | 1.000000 | -0.235053 | 0.656181 | 0.595110 |
| bill_depth_mm | -0.235053 | 1.000000 | -0.583851 | -0.471916 |
| flipper_length_mm | 0.656181 | -0.583851 | 1.000000 | 0.871202 |
| body_mass_g | 0.595110 | -0.471916 | 0.871202 | 1.000000 |
N = W.unstack().to_frame('r')
N.index.names = ['f1','f2']
N
| r | ||
|---|---|---|
| f1 | f2 | |
| bill_length_mm | bill_length_mm | 1.000000 |
| bill_depth_mm | -0.235053 | |
| flipper_length_mm | 0.656181 | |
| body_mass_g | 0.595110 | |
| bill_depth_mm | bill_length_mm | -0.235053 |
| bill_depth_mm | 1.000000 | |
| flipper_length_mm | -0.583851 | |
| body_mass_g | -0.471916 | |
| flipper_length_mm | bill_length_mm | 0.656181 |
| bill_depth_mm | -0.583851 | |
| flipper_length_mm | 1.000000 | |
| body_mass_g | 0.871202 | |
| body_mass_g | bill_length_mm | 0.595110 |
| bill_depth_mm | -0.471916 | |
| flipper_length_mm | 0.871202 | |
| body_mass_g | 1.000000 |
Show unique and non-identical pairs.
X = N.query("f1 < f2").sort_values('r', ascending=False)
X
| r | ||
|---|---|---|
| f1 | f2 | |
| body_mass_g | flipper_length_mm | 0.871202 |
| bill_length_mm | flipper_length_mm | 0.656181 |
| body_mass_g | 0.595110 | |
| bill_depth_mm | bill_length_mm | -0.235053 |
| body_mass_g | -0.471916 | |
| flipper_length_mm | -0.583851 |
X.r.plot.barh();