NB: NumPy Continued

Quick Refresher on Shape

The shape of an array is represented a tuple, e.g. (n, m) for a 2D array.

  • The length of the tuple is number of dimensions (i.e. axes).
  • The values of the tuple are the number of elements in each dimension (axis).

Consider the array foo:

import numpy as np
foo = np.ones((6,4))
foo
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

It was created with the \(2\) dimensions. * The first dimension, axis 0, has \(6\) elements. In this case, these elements are arrays. * The second dimension, axis 1, has \(4\) elements. Its elements are scalars (numbers in this case).

The shape of foo[0] is \(4\). It contains \(1\) axis with \(4\) elements.

foo[0], foo[0].shape
(array([1., 1., 1., 1.]), (4,))

It has a shape of \(1\) and not \(4 \times 1\) because it is a vector, not a matrix.

SO, there is a difference between a vector and a 1-column matrix.

Reshaping

If we want to make it into a 1-column matrix, we need to reshape it using np.reshape().

Note that the first value of the shape argument is \(-1\). This means use the length of the vector that is passed to it.

## np.reshape?
foo2 = np.reshape(foo[0], (-1, 1))
foo2, foo2.shape
(array([[1.],
        [1.],
        [1.],
        [1.]]),
 (4, 1))

When indexing an array, think of the positions of the comma-delimitted tuple as the axis.

The values are the element offsets in the containing array. The

foo[2:, 2:]
##   ^   ^ 
##   0   1  <- AXIS NUMBER
array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

Example: The Normal Equation

Sometimes, you need to convert vectors into 1-column matrices to make certain linear algebraic functions work.

Consider two random variables, \(x\) and \(y\).

x = np.random.randn(100)
y = np.random.randn(100)

We can fit a regression line using the normal equation, which appears in this week’s homework.

\(\begin{aligned} \hat\beta_i=(x^Tx)^{−1}x^Ty \end{aligned}\)

Here is a standward way of expressing it in NumPy:

get_beta1 = lambda x, y: (np.linalg.inv(x.T.dot(x))) * (x.T.dot(y))

However, it will fail if we pass it our two variables, x and y.

The reason is that it expects x to be a matrix, since it is designed to handle n-dimension predictor variables, usually represented as \(\textbf{X}\).

get_beta1(x, y)
LinAlgError: 0-dimensional array given. Array must be at least two-dimensional

The revised function will work with a vector as x:

\(\hat\beta_i = \large\frac{1}{x^Tx} \small(x^Ty)\)

get_beta2 = lambda x, y: (1 / x.T.dot(x)) * (x.T.dot(y))
get_beta2(x, y)
0.025226057106433126

We can fix the problem in the general case by converting our vector into a matrix using np.reshape().

x_as_matrix = np.reshape(x, (-1, 1))
get_beta1(x_as_matrix, y)
array([[0.02522606]])
get_beta1(x_as_matrix, y)[0][0]
0.025226057106433126
get_beta1(x_as_matrix, y)[0][0] == get_beta2(x, y)
True

One take-away here is that there is a difference betweek a scalar value and a 1 x 1 array.

Broadcasting

What happens when you try to perform an element-wise operation on two arrays of different shape?

NumPy will convert a low-dimensional array into a high-dimensional array to allow the operation to take place.

This is called broadcasting.

Let’s look at at our array foo:

foo
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

If we multiply it by 5, the scalar is converted into an array of the same shape as foo with the value 5 broadcast to populate the entire array.

foo * 5
array([[5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.]])

If we want to multiply an array by a vector, the vector is broadcast to become a 2D array.

foo * np.array([5, 10, 6, 8])
array([[ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.]])

Note that NumPy can’t always make the adjustment:

foo * np.array([5, 10])
ValueError: operands could not be broadcast together with shapes (6,4) (2,) 

Array-Oriented Programming

Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions without writing loops.

This practice of replacing explicit loops with array expressions is referred to by some people as vectorization.

Vectorized array operations are often significantly faster than their pure Python equivalents.

They are also visually concise and elegant, although loops have the virtue of visualizing what’s under the hood in an algorithm.

Expressing Conditional Logic as Array Operations

np.where()

The np.where function is a vectorized version of the ternary expression x if condition else y.

Suppose we had a boolean array and two arrays of values:

xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])
result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]
result
[1.1, 2.2, 1.3, 1.4, 2.5]

Here is the vectorized version:

result = np.where(cond, xarr, yarr)
result
array([1.1, 2.2, 1.3, 1.4, 2.5])
arr = np.random.randn(4, 4)
arr
array([[-0.07558452,  0.06060905,  0.08395713, -0.50584862],
       [-1.02703874, -2.08227842, -0.70058386,  0.42922211],
       [ 1.65264793, -0.0081504 , -0.88014697,  0.53720441],
       [ 0.98719467, -2.26285173,  0.51148161, -0.1798714 ]])
arr > 0
array([[False,  True,  True, False],
       [False, False, False,  True],
       [ True, False, False,  True],
       [ True, False,  True, False]])
np.where(arr > 0, 2, -2)
array([[-2,  2,  2, -2],
       [-2, -2, -2,  2],
       [ 2, -2, -2,  2],
       [ 2, -2,  2, -2]])
np.where(arr > 0, 2, arr) # set only positive values to 2
array([[-0.07558452,  2.        ,  2.        , -0.50584862],
       [-1.02703874, -2.08227842, -0.70058386,  2.        ],
       [ 2.        , -0.0081504 , -0.88014697,  2.        ],
       [ 2.        , -2.26285173,  2.        , -0.1798714 ]])

Mathematical and Statistical Methods

Statistical computations are aggregate functions apply to vectors within an array.

In a 2D array, they can be applied to rows or columns, i.e. axis \(0\) or axis \(1\).

Let’s create an array of random values. We can think of it is a table of observations and random variables.

arr = np.random.randn(5, 4)
arr

.mean()

arr.mean()
np.mean(arr)
arr.mean(axis=1)
arr.mean(axis=0)

.sum()

arr.sum()

Row wise aggregration

arr.sum(axis=0)
arr.mean(axis=1)

Column-wise aggregration

.cumsum()

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
arr.cumsum()
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr
arr.cumsum(axis=0)
arr.cumprod(axis=1)

Methods for Boolean Arrays

.sum()

Since booleans are \(0\)s and \(1\), we can sum them to get a total truth count.

arr = np.random.randn(100)
(arr > 0).sum() # Number of positive values
bools = np.array([False, False, True, False])

.any()

bools.any()

.all()

bools.all()

Sorting

.sort()

arr = np.random.randn(6)
arr
arr.sort()
arr
arr = np.random.randn(5, 3)
arr
arr.sort(1)
arr
large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))] # 5% quantile
0.05 * len(large_arr)

Unique and Other Set Logic

np.unique()

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names)
sorted(set(names))
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)

np.in1d()

Tests whether each element of a 1-D array is also present in a second array.

values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])

File Input and Output with Arrays

np.save()

Save an array to a binary file in NumPy .npy format.

Automatically adds the .npy file extension.

arr = np.arange(10)
np.save('some_array', arr)

np.load()

np.load('some_array.npy')

np.savetxt()

Save an array to a text file.

## np.savetxt?
np.savetxt('some_array.txt', arr)
## !more some_array.txt

np.savez()

Save several arrays into a single file in uncompressed .npz format.

np.savez('array_archive.npz', a=arr, b=arr)
arch = np.load('array_archive.npz')
arch['b']
np.savez_compressed('arrays_compressed.npz', a=arr, b=arr)

Clean up …

!rm some_array.npy
!rm some_array.txt
!rm array_archive.npz
!rm arrays_compressed.npz

Linear Algebra

.dot()

Dot product of two arrays. Specifically, - If both a and b are 1-D arrays, it is inner product of vectors (without complex conjugation). - If both a and b are 2-D arrays, it is matrix multiplication, but using matmul() or a @ b is preferred. - If either a or b is 0-D (scalar), it is equivalent to multiply() and using numpy.multiply(a, b) or a * b is preferred. - If a is an N-D array and b is a 1-D array, it is a sum product over the last axis of a and b. - If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:

    dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x, y
y
x.dot(y)
np.dot(x, y)
np.dot(x, np.ones(3))

In NumPy, the @ operator means matrix multiplication.

x @ np.ones(3)

np.linalg.inv()

from numpy.linalg import inv, qr
X = np.random.randn(5, 5)
mat = X.T.dot(X)
inv(mat)
mat.dot(inv(mat))
q, r = qr(mat)
r

Pseudorandom Number Generation

np.random.normal()

samples = np.random.normal(size=(4, 4))
samples


from random import normalvariate
N = 1000000
%timeit samples = [normalvariate(0, 1) for _ in range(N)]
%timeit np.random.normal(size=N)
np.random.seed(1234)
rng = np.random.RandomState(1234)
rng.randn(10)

Example: Random Walks

Let simulate a random walk. The walk will be represented as a vector.

We’ll do it first as loop, then with vectorization.

Loops

import random
position = 0
walk = [position] # initialize the walk
steps = 1000
for i in range(steps):
    step = 1 if random.randint(0, 1) else -1  # Coin toss
    position += step
    walk.append(position)
import matplotlib.pyplot as plt
plt.figure()
plt.plot(walk[:100]);

Vectors

np.random.seed(12345)
nsteps = 1000
draws = np.random.randint(0, 2, size=nsteps)
steps = np.where(draws > 0, 1, -1)
walk2 = steps.cumsum()
plt.plot(walk[:100]);
walk2.min(), walk2.max()

Simulating Many Random Walks at Once

nwalks = 5000
nsteps = 1000
draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(1)
draws
walks
walks.max(), walks.min()
hits30 = (np.abs(walks) >= 30).any(1)
hits30
hits30.sum() # Number that hit 30 or -30
crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)
crossing_times.mean()
steps = np.random.normal(loc=0, scale=0.25, size=(nwalks, nsteps))

Feel free to experiment with other distributions for the steps other than equal-sized coin flips. You need only use a different random generator method, like standard_normal to generate normally distributed steps with some mean and standard deviation:

draws = 0.25 * rng.standard_normal((nwalks, nsteps)) # Walks: observations, Steps: features
sample_walk = np.random.choice(len(draws))
plt.plot(draws[sample_walk])
plt.title(f"Walk #{sample_walk}");