import numpy as np
NB: NumPy Continued
Quick Refresher on Shape
The shape of an array is represented a tuple, e.g. (n, m)
for a 2D array.
- The length of the tuple is number of dimensions (i.e. axes).
- The values of the tuple are the number of elements in each dimension (axis).
Consider the array foo
:
= np.ones((6,4)) foo
foo
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
It was created with the \(2\) dimensions. * The first dimension, axis 0, has \(6\) elements. In this case, these elements are arrays. * The second dimension, axis 1, has \(4\) elements. Its elements are scalars (numbers in this case).
The shape of foo[0]
is \(4\). It contains \(1\) axis with \(4\) elements.
0], foo[0].shape foo[
(array([1., 1., 1., 1.]), (4,))
It has a shape of \(1\) and not \(4 \times 1\) because it is a vector, not a matrix.
SO, there is a difference between a vector and a 1-column matrix.
Reshaping
If we want to make it into a 1-column matrix, we need to reshape it using np.reshape()
.
Note that the first value of the shape argument is \(-1\). This means use the length of the vector that is passed to it.
## np.reshape?
= np.reshape(foo[0], (-1, 1)) foo2
foo2, foo2.shape
(array([[1.],
[1.],
[1.],
[1.]]),
(4, 1))
When indexing an array, think of the positions of the comma-delimitted tuple as the axis.
The values are the element offsets in the containing array. The
2:, 2:]
foo[## ^ ^
## 0 1 <- AXIS NUMBER
array([[1., 1.],
[1., 1.],
[1., 1.],
[1., 1.]])
Example: The Normal Equation
Sometimes, you need to convert vectors into 1-column matrices to make certain linear algebraic functions work.
Consider two random variables, \(x\) and \(y\).
= np.random.randn(100)
x = np.random.randn(100) y
We can fit a regression line using the normal equation, which appears in this week’s homework.
\(\begin{aligned} \hat\beta_i=(x^Tx)^{−1}x^Ty \end{aligned}\)
Here is a standward way of expressing it in NumPy:
= lambda x, y: (np.linalg.inv(x.T.dot(x))) * (x.T.dot(y)) get_beta1
However, it will fail if we pass it our two variables, x
and y
.
The reason is that it expects x
to be a matrix, since it is designed to handle n-dimension predictor variables, usually represented as \(\textbf{X}\).
get_beta1(x, y)
LinAlgError: 0-dimensional array given. Array must be at least two-dimensional
The revised function will work with a vector as x
:
\(\hat\beta_i = \large\frac{1}{x^Tx} \small(x^Ty)\)
= lambda x, y: (1 / x.T.dot(x)) * (x.T.dot(y)) get_beta2
get_beta2(x, y)
0.025226057106433126
We can fix the problem in the general case by converting our vector into a matrix using np.reshape()
.
= np.reshape(x, (-1, 1)) x_as_matrix
get_beta1(x_as_matrix, y)
array([[0.02522606]])
0][0] get_beta1(x_as_matrix, y)[
0.025226057106433126
0][0] == get_beta2(x, y) get_beta1(x_as_matrix, y)[
True
One take-away here is that there is a difference betweek a scalar value and a 1 x 1 array.
Broadcasting
What happens when you try to perform an element-wise operation on two arrays of different shape?
NumPy will convert a low-dimensional array into a high-dimensional array to allow the operation to take place.
This is called broadcasting.
Let’s look at at our array foo
:
foo
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
If we multiply it by 5, the scalar is converted into an array of the same shape as foo
with the value 5 broadcast to populate the entire array.
* 5 foo
array([[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.]])
If we want to multiply an array by a vector, the vector is broadcast to become a 2D array.
* np.array([5, 10, 6, 8]) foo
array([[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.]])
Note that NumPy can’t always make the adjustment:
* np.array([5, 10]) foo
ValueError: operands could not be broadcast together with shapes (6,4) (2,)
Array-Oriented Programming
Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions without writing loops.
This practice of replacing explicit loops with array expressions is referred to by some people as vectorization.
Vectorized array operations are often significantly faster than their pure Python equivalents.
They are also visually concise and elegant, although loops have the virtue of visualizing what’s under the hood in an algorithm.
Expressing Conditional Logic as Array Operations
np.where()
The np.where
function is a vectorized version of the ternary expression x if condition else y
.
Suppose we had a boolean array and two arrays of values:
= np.array([1.1, 1.2, 1.3, 1.4, 1.5])
xarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
yarr = np.array([True, False, True, True, False]) cond
= [(x if c else y) for x, y, c in zip(xarr, yarr, cond)] result
result
[1.1, 2.2, 1.3, 1.4, 2.5]
Here is the vectorized version:
= np.where(cond, xarr, yarr) result
result
array([1.1, 2.2, 1.3, 1.4, 2.5])
= np.random.randn(4, 4) arr
arr
array([[-0.07558452, 0.06060905, 0.08395713, -0.50584862],
[-1.02703874, -2.08227842, -0.70058386, 0.42922211],
[ 1.65264793, -0.0081504 , -0.88014697, 0.53720441],
[ 0.98719467, -2.26285173, 0.51148161, -0.1798714 ]])
> 0 arr
array([[False, True, True, False],
[False, False, False, True],
[ True, False, False, True],
[ True, False, True, False]])
> 0, 2, -2) np.where(arr
array([[-2, 2, 2, -2],
[-2, -2, -2, 2],
[ 2, -2, -2, 2],
[ 2, -2, 2, -2]])
> 0, 2, arr) # set only positive values to 2 np.where(arr
array([[-0.07558452, 2. , 2. , -0.50584862],
[-1.02703874, -2.08227842, -0.70058386, 2. ],
[ 2. , -0.0081504 , -0.88014697, 2. ],
[ 2. , -2.26285173, 2. , -0.1798714 ]])
Mathematical and Statistical Methods
Statistical computations are aggregate functions apply to vectors within an array.
In a 2D array, they can be applied to rows or columns, i.e. axis \(0\) or axis \(1\).
Let’s create an array of random values. We can think of it is a table of observations and random variables.
= np.random.randn(5, 4) arr
arr
.mean()
arr.mean()
np.mean(arr)
=1) arr.mean(axis
=0) arr.mean(axis
.sum()
sum() arr.
Row wise aggregration
sum(axis=0) arr.
=1) arr.mean(axis
Column-wise aggregration
.cumsum()
= np.array([0, 1, 2, 3, 4, 5, 6, 7]) arr
arr.cumsum()
= np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) arr
arr
=0) arr.cumsum(axis
=1) arr.cumprod(axis
Methods for Boolean Arrays
.sum()
Since booleans are \(0\)s and \(1\), we can sum them to get a total truth count.
= np.random.randn(100)
arr > 0).sum() # Number of positive values (arr
= np.array([False, False, True, False]) bools
.any()
any() bools.
.all()
all() bools.
Sorting
.sort()
= np.random.randn(6) arr
arr
arr.sort()
arr
= np.random.randn(5, 3) arr
arr
1) arr.sort(
arr
= np.random.randn(1000)
large_arr
large_arr.sort()int(0.05 * len(large_arr))] # 5% quantile large_arr[
0.05 * len(large_arr)
Unique and Other Set Logic
np.unique()
= np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names np.unique(names)
sorted(set(names))
= np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
ints np.unique(ints)
np.in1d()
Tests whether each element of a 1-D array is also present in a second array.
= np.array([6, 0, 0, 3, 2, 5, 6])
values 2, 3, 6]) np.in1d(values, [
File Input and Output with Arrays
np.save()
Save an array to a binary file in NumPy .npy
format.
Automatically adds the .npy
file extension.
= np.arange(10)
arr 'some_array', arr) np.save(
np.load()
'some_array.npy') np.load(
np.savetxt()
Save an array to a text file.
## np.savetxt?
'some_array.txt', arr) np.savetxt(
## !more some_array.txt
np.savez()
Save several arrays into a single file in uncompressed .npz
format.
'array_archive.npz', a=arr, b=arr) np.savez(
= np.load('array_archive.npz')
arch 'b'] arch[
'arrays_compressed.npz', a=arr, b=arr) np.savez_compressed(
Clean up …
!rm some_array.npy
!rm some_array.txt
!rm array_archive.npz
!rm arrays_compressed.npz
Linear Algebra
.dot()
Dot product of two arrays. Specifically, - If both a
and b
are 1-D arrays, it is inner product of vectors (without complex conjugation). - If both a
and b
are 2-D arrays, it is matrix multiplication, but using matmul()
or a @ b
is preferred. - If either a
or b
is 0-D (scalar), it is equivalent to multiply()
and using numpy.multiply(a, b)
or a * b
is preferred. - If a
is an N-D array and b
is a 1-D array, it is a sum product over the last axis of a
and b
. - If a
is an N-D array and b
is an M-D array (where M>=2
), it is a sum product over the last axis of a
and the second-to-last axis of b
:
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
= np.array([[1., 2., 3.], [4., 5., 6.]])
x = np.array([[6., 23.], [-1, 7], [8, 9]]) y
x, y
y
x.dot(y)
np.dot(x, y)
3)) np.dot(x, np.ones(
In NumPy, the @
operator means matrix multiplication.
@ np.ones(3) x
np.linalg.inv()
from numpy.linalg import inv, qr
= np.random.randn(5, 5)
X = X.T.dot(X)
mat inv(mat)
mat.dot(inv(mat))= qr(mat) q, r
r
Pseudorandom Number Generation
np.random.normal()
= np.random.normal(size=(4, 4))
samples samples
from random import normalvariate
= 1000000
N %timeit samples = [normalvariate(0, 1) for _ in range(N)]
%timeit np.random.normal(size=N)
1234) np.random.seed(
= np.random.RandomState(1234) rng
10) rng.randn(
Example: Random Walks
Let simulate a random walk. The walk will be represented as a vector.
We’ll do it first as loop, then with vectorization.
Loops
import random
= 0
position = [position] # initialize the walk
walk = 1000
steps for i in range(steps):
= 1 if random.randint(0, 1) else -1 # Coin toss
step += step
position walk.append(position)
import matplotlib.pyplot as plt
plt.figure()100]); plt.plot(walk[:
Vectors
12345) np.random.seed(
= 1000
nsteps = np.random.randint(0, 2, size=nsteps)
draws = np.where(draws > 0, 1, -1)
steps = steps.cumsum() walk2
100]); plt.plot(walk[:
min(), walk2.max() walk2.
Simulating Many Random Walks at Once
= 5000
nwalks = 1000 nsteps
= np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1
draws = np.where(draws > 0, 1, -1)
steps = steps.cumsum(1) walks
draws
walks
max(), walks.min() walks.
= (np.abs(walks) >= 30).any(1)
hits30 hits30
sum() # Number that hit 30 or -30 hits30.
= (np.abs(walks[hits30]) >= 30).argmax(1)
crossing_times crossing_times.mean()
= np.random.normal(loc=0, scale=0.25, size=(nwalks, nsteps)) steps
Feel free to experiment with other distributions for the steps other than equal-sized coin flips. You need only use a different random generator method, like standard_normal to generate normally distributed steps with some mean and standard deviation:
= 0.25 * rng.standard_normal((nwalks, nsteps)) # Walks: observations, Steps: features draws
= np.random.choice(len(draws))
sample_walk
plt.plot(draws[sample_walk])f"Walk #{sample_walk}"); plt.title(