import numpy as npNB: NumPy Continued
Quick Refresher on Shape
The shape of an array is represented a tuple, e.g. (n, m) for a 2D array.
- The length of the tuple is number of dimensions (i.e. axes).
- The values of the tuple are the number of elements in each dimension (axis).
Consider the array foo:
foo = np.ones((6,4))fooarray([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
It was created with the \(2\) dimensions. * The first dimension, axis 0, has \(6\) elements. In this case, these elements are arrays. * The second dimension, axis 1, has \(4\) elements. Its elements are scalars (numbers in this case).
The shape of foo[0] is \(4\). It contains \(1\) axis with \(4\) elements.
foo[0], foo[0].shape(array([1., 1., 1., 1.]), (4,))
It has a shape of \(1\) and not \(4 \times 1\) because it is a vector, not a matrix.
SO, there is a difference between a vector and a 1-column matrix.
Reshaping
If we want to make it into a 1-column matrix, we need to reshape it using np.reshape().
Note that the first value of the shape argument is \(-1\). This means use the length of the vector that is passed to it.
## np.reshape?foo2 = np.reshape(foo[0], (-1, 1))foo2, foo2.shape(array([[1.],
[1.],
[1.],
[1.]]),
(4, 1))
When indexing an array, think of the positions of the comma-delimitted tuple as the axis.
The values are the element offsets in the containing array. The
foo[2:, 2:]
## ^ ^
## 0 1 <- AXIS NUMBERarray([[1., 1.],
[1., 1.],
[1., 1.],
[1., 1.]])
Example: The Normal Equation
Sometimes, you need to convert vectors into 1-column matrices to make certain linear algebraic functions work.
Consider two random variables, \(x\) and \(y\).
x = np.random.randn(100)
y = np.random.randn(100)We can fit a regression line using the normal equation, which appears in this week’s homework.
\(\begin{aligned} \hat\beta_i=(x^Tx)^{−1}x^Ty \end{aligned}\)
Here is a standward way of expressing it in NumPy:
get_beta1 = lambda x, y: (np.linalg.inv(x.T.dot(x))) * (x.T.dot(y))However, it will fail if we pass it our two variables, x and y.
The reason is that it expects x to be a matrix, since it is designed to handle n-dimension predictor variables, usually represented as \(\textbf{X}\).
get_beta1(x, y)LinAlgError: 0-dimensional array given. Array must be at least two-dimensional
The revised function will work with a vector as x:
\(\hat\beta_i = \large\frac{1}{x^Tx} \small(x^Ty)\)
get_beta2 = lambda x, y: (1 / x.T.dot(x)) * (x.T.dot(y))get_beta2(x, y)0.025226057106433126
We can fix the problem in the general case by converting our vector into a matrix using np.reshape().
x_as_matrix = np.reshape(x, (-1, 1))get_beta1(x_as_matrix, y)array([[0.02522606]])
get_beta1(x_as_matrix, y)[0][0]0.025226057106433126
get_beta1(x_as_matrix, y)[0][0] == get_beta2(x, y)True
One take-away here is that there is a difference betweek a scalar value and a 1 x 1 array.
Broadcasting
What happens when you try to perform an element-wise operation on two arrays of different shape?
NumPy will convert a low-dimensional array into a high-dimensional array to allow the operation to take place.
This is called broadcasting.
Let’s look at at our array foo:
fooarray([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
If we multiply it by 5, the scalar is converted into an array of the same shape as foo with the value 5 broadcast to populate the entire array.
foo * 5array([[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.]])
If we want to multiply an array by a vector, the vector is broadcast to become a 2D array.
foo * np.array([5, 10, 6, 8])array([[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.]])
Note that NumPy can’t always make the adjustment:
foo * np.array([5, 10])ValueError: operands could not be broadcast together with shapes (6,4) (2,)
Array-Oriented Programming
Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions without writing loops.
This practice of replacing explicit loops with array expressions is referred to by some people as vectorization.
Vectorized array operations are often significantly faster than their pure Python equivalents.
They are also visually concise and elegant, although loops have the virtue of visualizing what’s under the hood in an algorithm.
Expressing Conditional Logic as Array Operations
np.where()
The np.where function is a vectorized version of the ternary expression x if condition else y.
Suppose we had a boolean array and two arrays of values:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]result[1.1, 2.2, 1.3, 1.4, 2.5]
Here is the vectorized version:
result = np.where(cond, xarr, yarr)resultarray([1.1, 2.2, 1.3, 1.4, 2.5])
arr = np.random.randn(4, 4)arrarray([[-0.07558452, 0.06060905, 0.08395713, -0.50584862],
[-1.02703874, -2.08227842, -0.70058386, 0.42922211],
[ 1.65264793, -0.0081504 , -0.88014697, 0.53720441],
[ 0.98719467, -2.26285173, 0.51148161, -0.1798714 ]])
arr > 0array([[False, True, True, False],
[False, False, False, True],
[ True, False, False, True],
[ True, False, True, False]])
np.where(arr > 0, 2, -2)array([[-2, 2, 2, -2],
[-2, -2, -2, 2],
[ 2, -2, -2, 2],
[ 2, -2, 2, -2]])
np.where(arr > 0, 2, arr) # set only positive values to 2array([[-0.07558452, 2. , 2. , -0.50584862],
[-1.02703874, -2.08227842, -0.70058386, 2. ],
[ 2. , -0.0081504 , -0.88014697, 2. ],
[ 2. , -2.26285173, 2. , -0.1798714 ]])
Mathematical and Statistical Methods
Statistical computations are aggregate functions apply to vectors within an array.
In a 2D array, they can be applied to rows or columns, i.e. axis \(0\) or axis \(1\).
Let’s create an array of random values. We can think of it is a table of observations and random variables.
arr = np.random.randn(5, 4)arr.mean()
arr.mean()np.mean(arr)arr.mean(axis=1)arr.mean(axis=0).sum()
arr.sum()Row wise aggregration
arr.sum(axis=0)arr.mean(axis=1)Column-wise aggregration
.cumsum()
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])arr.cumsum()arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])arrarr.cumsum(axis=0)arr.cumprod(axis=1)Methods for Boolean Arrays
.sum()
Since booleans are \(0\)s and \(1\), we can sum them to get a total truth count.
arr = np.random.randn(100)
(arr > 0).sum() # Number of positive valuesbools = np.array([False, False, True, False]).any()
bools.any().all()
bools.all()Sorting
.sort()
arr = np.random.randn(6)arrarr.sort()arrarr = np.random.randn(5, 3)arrarr.sort(1)arrlarge_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))] # 5% quantile0.05 * len(large_arr)Unique and Other Set Logic
np.unique()
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names)sorted(set(names))ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)np.in1d()
Tests whether each element of a 1-D array is also present in a second array.
values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])File Input and Output with Arrays
np.save()
Save an array to a binary file in NumPy .npy format.
Automatically adds the .npy file extension.
arr = np.arange(10)
np.save('some_array', arr)np.load()
np.load('some_array.npy')np.savetxt()
Save an array to a text file.
## np.savetxt?np.savetxt('some_array.txt', arr)## !more some_array.txtnp.savez()
Save several arrays into a single file in uncompressed .npz format.
np.savez('array_archive.npz', a=arr, b=arr)arch = np.load('array_archive.npz')
arch['b']np.savez_compressed('arrays_compressed.npz', a=arr, b=arr)Clean up …
!rm some_array.npy
!rm some_array.txt
!rm array_archive.npz
!rm arrays_compressed.npzLinear Algebra
.dot()
Dot product of two arrays. Specifically, - If both a and b are 1-D arrays, it is inner product of vectors (without complex conjugation). - If both a and b are 2-D arrays, it is matrix multiplication, but using matmul() or a @ b is preferred. - If either a or b is 0-D (scalar), it is equivalent to multiply() and using numpy.multiply(a, b) or a * b is preferred. - If a is an N-D array and b is a 1-D array, it is a sum product over the last axis of a and b. - If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])x, yyx.dot(y)np.dot(x, y)np.dot(x, np.ones(3))In NumPy, the @ operator means matrix multiplication.
x @ np.ones(3)np.linalg.inv()
from numpy.linalg import inv, qr
X = np.random.randn(5, 5)
mat = X.T.dot(X)
inv(mat)mat.dot(inv(mat))
q, r = qr(mat)rPseudorandom Number Generation
np.random.normal()
samples = np.random.normal(size=(4, 4))
samples
from random import normalvariate
N = 1000000
%timeit samples = [normalvariate(0, 1) for _ in range(N)]
%timeit np.random.normal(size=N)np.random.seed(1234)rng = np.random.RandomState(1234)rng.randn(10)Example: Random Walks
Let simulate a random walk. The walk will be represented as a vector.
We’ll do it first as loop, then with vectorization.
Loops
import random
position = 0
walk = [position] # initialize the walk
steps = 1000
for i in range(steps):
step = 1 if random.randint(0, 1) else -1 # Coin toss
position += step
walk.append(position)import matplotlib.pyplot as pltplt.figure()
plt.plot(walk[:100]);Vectors
np.random.seed(12345)nsteps = 1000
draws = np.random.randint(0, 2, size=nsteps)
steps = np.where(draws > 0, 1, -1)
walk2 = steps.cumsum()plt.plot(walk[:100]);walk2.min(), walk2.max()Simulating Many Random Walks at Once
nwalks = 5000
nsteps = 1000draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(1)drawswalkswalks.max(), walks.min()hits30 = (np.abs(walks) >= 30).any(1)
hits30hits30.sum() # Number that hit 30 or -30crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)
crossing_times.mean()steps = np.random.normal(loc=0, scale=0.25, size=(nwalks, nsteps))Feel free to experiment with other distributions for the steps other than equal-sized coin flips. You need only use a different random generator method, like standard_normal to generate normally distributed steps with some mean and standard deviation:
draws = 0.25 * rng.standard_normal((nwalks, nsteps)) # Walks: observations, Steps: featuressample_walk = np.random.choice(len(draws))
plt.plot(draws[sample_walk])
plt.title(f"Walk #{sample_walk}");